Even though this topic may be a bit polarizing, especially among people who are new to data science, knowing more about it can be very useful, particularly if you value a sense of perspective more than a good grade in some data science crash course. The latter is bound to overemphasize either Stats or AI, depending on the instructor's knowledge and experience. However, some data science professionals, myself included, prefer a more balanced approach on the topic. This is the reason why I decided to make this video, which is now available on Safari for your viewing.
Note that this is by no means a complete tutorial on the topic, but it is a good overview of the various aspects of statistics related to data science, along with some programming resources in both Python and Julia, to get you started. Enjoy!
Recently I decided to spice things up a bit and experiment with a new, more fresh approach to videos. As a result, I played around with graphics more, in an effort to go for a more intuitive presentation of the topic I looked at, namely sampling (check out the video here). Not all videos that ensue are going to be like that, but I’m definitely going to look into more interesting ways of tackling the graphics part.
This kind of video production takes a lot of work though and as I haven’t done graphics design in years, I’m a bit rusty and such a project takes a considerable amount of time. At the same time, I need to keep promoting my stuff online and one of the strategies I’ve found quite effective is through articles on beBee. As a result, I won’t be posting articles on my blog that often. However, if someone is interested in contributing to it, I’d be happy to consider guest authors on Data Science, A.I., Cyber-security, Programming, and other relevant topics.
A.I. and ML are often used interchangeably, while many people consider one to be a subset of the other (which one is the bigger set depends on who you ask). However, things may not be as clear-cut as they may seem, since the communities of these two fields are not all the related, while there is a sort of rivalry among the hard-core members of each one of them. Why is that though if A.I. and ML are so similar to each other, enough to confuse even data scientists?
First of all, let’s start with some definitions. A.I. is the group of methods, algorithms, and processes, that bring about computer systems that emulate human intelligence, even if the intelligence they usually exhibit is quite different to our own. Also, these systems often take the form of self-sufficient machines, such as robots, as well as agent programs that roam the Internet or cyber space in general. ML on the other hand is the group of methods, algorithms, and processes that bring about computer systems that solve some data analytics problem in an efficient manner, through some training procedure (the learning part of machine learning). The latter can be with the help of some specific outcomes (aka targets) or without. Also, the training can take the form of feedback on the system’s predictions, which is like on-the-job training of sorts.
Clearly, there is a close link between ML and data science, since ML systems are designed for this sort of problems. A.I. systems on the other hand, may tackle different kinds of problems too (e.g. finding the optimal route given some restrictions). So, there is a part of A.I. that is leveraged in data science and a part of A.I. that has nothing to do with our craft. That part of A.I. that is used in data science has a large intersect with ML, mainly through network-based systems, such as ANNs. Lately, Deep Learning networks, which are specialized and more sophisticated kinds of ANNs, have become quite popular and are also part of that intersect between A.I. and ML.
Many people who work in A.I. consider it more of a science than ML and they are right in a way. Most of ML methods are heuristics based and don’t have much theory behind them, while the ones that are tied to Stats (Statistical and ML hybrids) are heavily restrained by the assumptions that the Stats theory has. A.I. methods are generally data-driven though, but also related to processes found in nature, so they are not out of the blue.
Nevertheless, a data scientist who is being professional and pragmatic doesn’t put too much emphasis on the differences between A.I. and ML methods, since he cares more about how they can be applied to solve the problems at hand. So, even if these two families of methods are not the same, nor is one a subset of the other, they are both very useful, if not essential, in practical data science.
Recently I made a new data science video, this time on Anomaly Detection. Also, I experimented with subtitles in this one, for those who may have a hard time understanding my accent. Feel free to let me know what you think, via this blog site. You can check out the vid on Safari here.
Note that this 18-minute video is an overview of the topic and although it provides some Math fundamentals and some Python resources, it is not an in-depth analysis on the topic. For the latter, you may need to consult a book or a tutorial on the topic. Cheers
Recently a far-reaching scandal broke out as a reporter exposed a data science company called Cambridge Analytica. According to the information gathered, that company used a dataset harvested via Facebook and enriched with a lot of data from the Facebook graph too, in order to use it to affect the presidential elections of 2016 in the USA. It is important to note that the role of that project was not exploratory (e.g. related to the finding of insights related to the voters), but rather it aimed at steering the voters’ views on a certain candidate, in order to benefit the other candidate, which was the company’s client.
Personally I’m not vested in US politics and don’t have any strong views on the matter, which is why I chose to omit the names of the politicians involved. As a data science professional, however, I find what C.A. did was shameful and unethical, on many levels. Examples like this only go to show that just like everything else in applied science, data science can be used for malicious purposes too, something that every data scientist ought to be aware of and avoid whenever possible.
Also, a topic like this one concerns not just data scientists but anyone working alongside them, since it would be naive to believe that this whole fiasco was the result of a few data science professionals acting on their own. As the corresponding footage shows, the black-hat approach to data analytics was initiated by the company’s head, who was quite forth-coming about what the company was trying to do. That doesn’t make the data scientists working there innocent victims, but at least the responsibility of this dark project is shared among everyone there, not just them. Also, considering that it wasn’t a huge company, it’s quite unlikely that the data scientists weren’t aware of the unethical and immoral agenda their work was serving. However, it is clear that if they hadn’t cooperated with this plan, this could at the very least have slowed things down.
So, how can we guard ourselves from situations like that of the C.A. scandal, as data science professionals? First of all, we can avoid working for people who don’t have a moral compass and who are looking at how the data products developed can be used to covertly drive certain behaviors that if exposed, would be punishable. So, if the leaders of a project are shady individuals and don’t mind hurting others in order to make their clients happy, that’s a red flag.
The data itself could be another potential warning sign. If it is collected through unethical means and used in ways that compromise the people’s privacy, then that’s a tell-tale sign that there is something fishy going on. Another such sign is the insights discovered through such a project (in this case the categorization of the people involved into four groups that relate to some intimate aspects of their personalities). If we are not comfortable sharing these insights with those people (assuming that there were no NDA in place prohibiting that), because it just feels wrong, then we shouldn’t be digging up those insights to start with.
Finally, if the data products don’t serve the people involved in the data behind these products, even indirectly, then that’s another red flag. The products we create should be something we can talk about openly (without giving away any sensitive know-how behind them, of course), without feeling ashamed or guilty about their purpose.
Naturally, these few suggestions are but the tip of the iceberg of a very large topic related to the Ethics aspect of our profession. I cannot hope to do this topic justice through a blog article, or even a video like the one I made on this topic last year. However, it’s good to remember that we are not powerless against the malicious use of data science by people who are either immoral or amoral, caring only for themselves at the expense of the well-being of others. We may not always be able to stop their agenda, but we can at least identify an unethical project and not contribute to it. Besides, there are many things we can do with data science, so why not focus on the more beneficial ones instead?
I understand that making predictions about these things is quite risky, but it’s good to take a stance about the things that matter, instead of playing it safe, like many tech “experts” out there do. Of course it’s easier to parrot the widely accepted views on every hot topic, gathering “likes” and positive comments, but no-one ever offered anything useful to the whole by being all lukewarm.
First of all, I’m not making a case against cryptocurrencies as a possibility. In fact, I find them immensely useful in potential, especially in a country where the conventional currency is plagued by inflation and by all the idiotic people managing the economy around it. Cryptocurrencies can be a viable alternative to the official currency, should they be used instead of a problematic fiat currency. The reality of cryptocurrencies, however, is very distant from this idealistic scenario. In fact, I’ve yet to encounter one cryptocurrency that is actually used as a currency of sorts. Most of them are some form of speculative investment, like a stock, but without any inherent value. Let that sink in for a bit; cryptocurrencies themselves have no value whatsoever.
Someone may argue that conventional currencies have no inherent value either, and that’s a valid point. However, conventional currencies’ value doesn’t fluctuate wildly over time, since there are mechanisms to keep it somewhat stable. Naturally, there are exceptions, but even an unstable currency is generally more stable than the average cryptocurrency out there. The reason is simple: people who handle cryptocurrencies do so with one particular aim: to make money off them. They don’t care if they disappear tomorrow, as long as they cash them in first. It doesn’t take a financial genius to understand that this sort of ecosystem is not sustainable. The other reason is a bit more subtle than that, yet equally important. Most cryptocurrencies require someone to constantly work for them (a process known as mining), or provide some sort of infrastructure that’s not cheap to maintain. The translates into a running cost, which may not seem much individually, but collectively it is a lot, enough to make the whole system unsustainable. This is particularly true in cases like bitcoin, where the computational problems needed to be solved to maintain the blockchain behind the cryptocurrency get progressively more challenging, and therefore more expensive. Once enough people realize that, the fascination of these cryptocurrencies may wane, especially if some regulating mechanism comes into place.
Artificial Intelligence on the other hand is a completely different animal. Even the most basic applications of it add value to whoever invests in them, be it someone tackling big data problems, or someone who just wants to optimize their technical infrastructure. Through a vast variety of ways, A.I. manages to add value to the people using it, particularly if they have developed it sufficiently, thereby automating certain expensive processes. That’s why people are amazed by it and spend hours speculating how it can help bring about numerous benefits to the world. Even if there are some inevitable pitfalls in this technology, if it is handled maturely, it can be of great benefit for the whole. Besides, as a scientific field it existed and it flourished on its own, long before the futurists used it for promoting their ideology, or before it became mainstream.
Hopefully it won’t be long before the cryptocurrency craze subsides and people who waste their time and energy on it focus their efforts into something more sustainable, something that adds value to its environment rather than drain resources and time. Perhaps this could be A.I., or some other similar technology. Whatever the case, the cryptocurrencies that are around today have an expiration date, whether people are willing to accept that or not...
Happy π day everyone! π (usually pronounced as "pie") is a very important constant in Mathematics and although it's not so relevant to this blog, I decided to celebrate it through this video. What does a passwords vid have to do with π? Well, it's not just about Passwords, since it also covers Information, and Entropy. This is why it was code named PIE during its production last week. To be honest, I wasn't expecting that it would be published on the actual π day, but stranger things have happened!
Anyway, if you want to learn more about this fascinating topic through a fairly lightweight video full of useful tips on how to evaluate passwords and how to create strong yet memorable ones, this is the video for you. Check it out on Safari when you have a moment!
Questions like this one are both shallow and pointless, on many levels. It is like asking what kind of classical guitar is it best to use for playing scales. Obviously, if you want to practice scales, it doesn’t matter what kind of guitar you use. You play the scales for improving your technique. As for playing a musical piece, if you are a good guitarist, you can play that piece well, even with a cheap guitar.
Of course, the matter of the OS is more polarizing than a guitar, which is why people have a hard time seeing the merit of another OS. Also, for many people the OS they use is part of their identity, much like people who support a particular sports team in such a way that they are willing to get violent towards the supporters of another team. The difference in this case is that the violence is over the internet usually and takes the form of passive-aggressive comments and insults.
Data science is a field of science, focusing mainly on applications. As a result, a data science professional is more concerned about the way she works with the data at hand, to make it into something useful. More often than not, this involves some predictive analytics model too, which she has to train, test, and fine-tune. All that stuff she does without any concern about the OS used, since the programming language she works with is cross-platform. Also, if she is able to work that programming language well, she won’t have issue with shell scripting in a particular OS, which is fairly simple by comparison.
Now, some OSes are faster than others, so someone may prefer that, even if it usually comes at a cost. The latter involves the lack of user-friendliness that a faster OS usually has. Whatever the case, if the code created is done well, it’s bound to be fast even in a slower OS. Also, if the code needs to run for a long time, then it’s probably better off being run on a computer cluster or on the cloud, so the OS you use on your computer is not that important.
So, if you are comfortable with OS X and can do your data science work there efficiently, that’s all that matters. Let the people who have nothing else to do argue about which OS is better or worse for data science. If those people weren't arguing about this trivial matter, they’d probably be arguing about which soft drink is better, or which sports team should win the championship. As a data scientist you have better things to do than waste time talking with them, since it’s unlikely they are going to ever change their minds about their view anyway.
After investigating this topic quite a bit, as I was looking into A.I. stuff, I decided to create a video on it. To make it more complete, I included other methods too, such as Statistics-based and heuristics-based ones. Despite the excessive amount of content I put together into this project (the script was over 4000 words), I managed to keep the video at a manageable length (a bit less than half an hour). Check it out on Safari when you have some time!
Distributed Ledger Technology (DLT) is a modern advent in IT that enables the forming of consensus across a distance, within a reasonable amount of time. This decentralized approach to computing has a variety of real-world applications, such as crypto-currencies, automated contracts, etc. Most people think of Blockchain as the best DLT out there, but lately there has been a new player in this field called Hashgraph, which is quite promising. Let’s take a closer look at it.
Hashgraph is a new approach to DLT, that promises to be better than Blockchain and its variants. Right now it is still in its early stages and to many people it seems somewhat academic, but it’s quite mature as a technology, since the company that developed it already has industrial clients. Also, Dr. Leemon Baird, who came up with the idea and co-founded the company, had been working on it for a few years now, so it’s not some trendy idea that is bound to disappear in a few months. You can learn more about the specifics of its functionality in this comprehensive deck of it, which demonstrates how an extension of the gossip protocol is employed to greatly speed up voting among the nodes (in what is called “virtual voting”) to drive consensus among them in an efficient and safe manner.
The key advantage of this tech is that it’s quite fast. Not two or three times faster than the best alternative, but orders of magnitude faster. The reason for this is that it has a lower computational cost plus there is no need for waiting for potential disagreements to resolve themselves (a problem in Blockchain, related to forking).
Another advantage is that hashgraph can guarantee fairness. That’s not a marketing-related point, but something mathematically proven, deriving from the fact that the knowledge of the system that reaches any particular node converges over time to the total information of that system. Fairness is linked to consensus and it is essential for a network where the first-come-first-serve rule can apply.
Also, this system is pretty secure, since it has what is known as Byzantine Fault Tolerance. This is related to the condition whereby all the members of the network know when they have reached consensus and there is no room for doubt about it. This level of security ensures that certain network attacks, like Distributed Denial of Service (DDoS) cannot harm a hashgraph network. Also, this system doesn’t require proof-of-work, like many Blockchain-based systems, something that brings us to the next point.
Finally, hashgraph is lightweight, requiring very little power to run. This makes it also very scalable as it can run on any kind of device (e.g. a smartphone) and sustainable (it requires a manageable amount of energy when it scales). This last point is not mentioned in the hashgraph site, but it’s a very important one, since it can make the difference between a good system in theory and a highly functional one.
Also not mentioned on the site are a couple of disadvantages of hashgraph. First of all, the whole technology is patented. This means that
1. not everyone can tweak its code (though there is an SDK available if you want to build an application on top of the hashgraph infrastructure), and
2. there is no guarantee that this tech is going to remain free in the future.
What’s more, hashgraph doesn’t have a crypto-currency yet (even though there are scammers who try to sell you hashgraph coins). This makes hashgraph seem less applicable to most people, who have already become familiar with alternative currencies like Bitcoin and Ripple. Also, a crypto-currency seems like an obvious application of a DLT, so not having one based on hashgraph raises suspicion about how serious it is about competing with Blockchain.
Overall, hashgraph seems like a very promising kind of DLT, perhaps one that can add a lot of value to the digital world. It has become clear that despite the various shortcomings of DLTs, the decentralization options they offer are something the world needs, even if it may take a few years before this mindset becomes mainstream. Hashgraph still has some issues and may not be able to win everyone, but just like there are several closed-source technologies out there that have their share of users, hashgraph may be able to earn a piece of the pie of cyber products and services, such as automated contracts, easy online collaboration without the need of servers, fast crypt-currencies, etc.
How Is This Relevant to Data Science / A.I.?
Since most breakthroughs come about from the combination of different technologies, it would be naive to think that data science or A.I. will evolve on their own. New technologies like DLTs, IoT and Cyber-security have a role to play, since they come with their own sets of data streams and data problems that need to be solved. Perhaps the next generation of data products will run as applications on a hashgraph, instead of a cloud, while the latter may be used just to store the most sensitive data only. Whatever the case, it’s good to look at the bigger picture of the tech world when contemplating about data science and artificial intelligence, since the more the various tech fields evolve, the more inter-connected they become.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.