Happy π day everyone! π (usually pronounced as "pie") is a very important constant in Mathematics and although it's not so relevant to this blog, I decided to celebrate it through this video. What does a passwords vid have to do with π? Well, it's not just about Passwords, since it also covers Information, and Entropy. This is why it was code named PIE during its production last week. To be honest, I wasn't expecting that it would be published on the actual π day, but stranger things have happened!
Anyway, if you want to learn more about this fascinating topic through a fairly lightweight video full of useful tips on how to evaluate passwords and how to create strong yet memorable ones, this is the video for you. Check it out on Safari when you have a moment!
Questions like this one are both shallow and pointless, on many levels. It is like asking what kind of classical guitar is it best to use for playing scales. Obviously, if you want to practice scales, it doesn’t matter what kind of guitar you use. You play the scales for improving your technique. As for playing a musical piece, if you are a good guitarist, you can play that piece well, even with a cheap guitar.
Of course, the matter of the OS is more polarizing than a guitar, which is why people have a hard time seeing the merit of another OS. Also, for many people the OS they use is part of their identity, much like people who support a particular sports team in such a way that they are willing to get violent towards the supporters of another team. The difference in this case is that the violence is over the internet usually and takes the form of passive-aggressive comments and insults.
Data science is a field of science, focusing mainly on applications. As a result, a data science professional is more concerned about the way she works with the data at hand, to make it into something useful. More often than not, this involves some predictive analytics model too, which she has to train, test, and fine-tune. All that stuff she does without any concern about the OS used, since the programming language she works with is cross-platform. Also, if she is able to work that programming language well, she won’t have issue with shell scripting in a particular OS, which is fairly simple by comparison.
Now, some OSes are faster than others, so someone may prefer that, even if it usually comes at a cost. The latter involves the lack of user-friendliness that a faster OS usually has. Whatever the case, if the code created is done well, it’s bound to be fast even in a slower OS. Also, if the code needs to run for a long time, then it’s probably better off being run on a computer cluster or on the cloud, so the OS you use on your computer is not that important.
So, if you are comfortable with OS X and can do your data science work there efficiently, that’s all that matters. Let the people who have nothing else to do argue about which OS is better or worse for data science. If those people weren't arguing about this trivial matter, they’d probably be arguing about which soft drink is better, or which sports team should win the championship. As a data scientist you have better things to do than waste time talking with them, since it’s unlikely they are going to ever change their minds about their view anyway.
After investigating this topic quite a bit, as I was looking into A.I. stuff, I decided to create a video on it. To make it more complete, I included other methods too, such as Statistics-based and heuristics-based ones. Despite the excessive amount of content I put together into this project (the script was over 4000 words), I managed to keep the video at a manageable length (a bit less than half an hour). Check it out on Safari when you have some time!
Distributed Ledger Technology (DLT) is a modern advent in IT that enables the forming of consensus across a distance, within a reasonable amount of time. This decentralized approach to computing has a variety of real-world applications, such as crypto-currencies, automated contracts, etc. Most people think of Blockchain as the best DLT out there, but lately there has been a new player in this field called Hashgraph, which is quite promising. Let’s take a closer look at it.
Hashgraph is a new approach to DLT, that promises to be better than Blockchain and its variants. Right now it is still in its early stages and to many people it seems somewhat academic, but it’s quite mature as a technology, since the company that developed it already has industrial clients. Also, Dr. Leemon Baird, who came up with the idea and co-founded the company, had been working on it for a few years now, so it’s not some trendy idea that is bound to disappear in a few months. You can learn more about the specifics of its functionality in this comprehensive deck of it, which demonstrates how an extension of the gossip protocol is employed to greatly speed up voting among the nodes (in what is called “virtual voting”) to drive consensus among them in an efficient and safe manner.
The key advantage of this tech is that it’s quite fast. Not two or three times faster than the best alternative, but orders of magnitude faster. The reason for this is that it has a lower computational cost plus there is no need for waiting for potential disagreements to resolve themselves (a problem in Blockchain, related to forking).
Another advantage is that hashgraph can guarantee fairness. That’s not a marketing-related point, but something mathematically proven, deriving from the fact that the knowledge of the system that reaches any particular node converges over time to the total information of that system. Fairness is linked to consensus and it is essential for a network where the first-come-first-serve rule can apply.
Also, this system is pretty secure, since it has what is known as Byzantine Fault Tolerance. This is related to the condition whereby all the members of the network know when they have reached consensus and there is no room for doubt about it. This level of security ensures that certain network attacks, like Distributed Denial of Service (DDoS) cannot harm a hashgraph network. Also, this system doesn’t require proof-of-work, like many Blockchain-based systems, something that brings us to the next point.
Finally, hashgraph is lightweight, requiring very little power to run. This makes it also very scalable as it can run on any kind of device (e.g. a smartphone) and sustainable (it requires a manageable amount of energy when it scales). This last point is not mentioned in the hashgraph site, but it’s a very important one, since it can make the difference between a good system in theory and a highly functional one.
Also not mentioned on the site are a couple of disadvantages of hashgraph. First of all, the whole technology is patented. This means that
1. not everyone can tweak its code (though there is an SDK available if you want to build an application on top of the hashgraph infrastructure), and
2. there is no guarantee that this tech is going to remain free in the future.
What’s more, hashgraph doesn’t have a crypto-currency yet (even though there are scammers who try to sell you hashgraph coins). This makes hashgraph seem less applicable to most people, who have already become familiar with alternative currencies like Bitcoin and Ripple. Also, a crypto-currency seems like an obvious application of a DLT, so not having one based on hashgraph raises suspicion about how serious it is about competing with Blockchain.
Overall, hashgraph seems like a very promising kind of DLT, perhaps one that can add a lot of value to the digital world. It has become clear that despite the various shortcomings of DLTs, the decentralization options they offer are something the world needs, even if it may take a few years before this mindset becomes mainstream. Hashgraph still has some issues and may not be able to win everyone, but just like there are several closed-source technologies out there that have their share of users, hashgraph may be able to earn a piece of the pie of cyber products and services, such as automated contracts, easy online collaboration without the need of servers, fast crypt-currencies, etc.
How Is This Relevant to Data Science / A.I.?
Since most breakthroughs come about from the combination of different technologies, it would be naive to think that data science or A.I. will evolve on their own. New technologies like DLTs, IoT and Cyber-security have a role to play, since they come with their own sets of data streams and data problems that need to be solved. Perhaps the next generation of data products will run as applications on a hashgraph, instead of a cloud, while the latter may be used just to store the most sensitive data only. Whatever the case, it’s good to look at the bigger picture of the tech world when contemplating about data science and artificial intelligence, since the more the various tech fields evolve, the more inter-connected they become.
Recently I had a couple of very insightful conversations with some people, over drinks or coffee. We talked about A.I. systems and how they can pose a threat to society. The funny thing is that none of these people were A.I. experts, yet they had a very mature perspective on the topic. This lead me to believe that if non-experts have such concerns about A.I. then perhaps it’s not as niche a topic as it seemed. BTW, the dangers they pinpointed had nothing to do with robots taking over the world through some Hollywood-like scenario, but were far more subtle, just like A.I. itself. Also, they are not about how A.I. can hurt us sometime in the future but how its dangers have already started to manifest. So, I thought about this topic some more, going beyond the generic and quite vague warnings that some individuals have shared with the world over interviews. The main dangers I’ve identified through this quest are the following:
Interestingly, all of these have more to do with us, as people, rather than the adaptive code that powers these artificial mental processes we call A.I.
Over-reliance on A.I.
Let’s start with the most obvious pitfall, over-reliance on this new tech. In a way, this is actually happening to some extent, since many of us use A.I. even without realizing it and have come to depend on it. Pretty much every system that runs on a smart phone that makes the device “smart” is something to watch out for. From virtual assistants to adaptive home screens, to social chatbots, these are A.I. systems that we may get used to so much that we won’t be able to do without. Personally I don’t use any of these, but as the various operating systems evolve, they may not leave users a choice when it comes to the use of A.I. in them.
Degradation of Soft Skills
Soft skills may be something many people talk about and even more have come to value, especially in the workplace. However, with A.I. becoming more and more of a smooth interface for us (e.g. with customer service bots), we may not be as motivated to cultivate these skills. This inevitably leads to their degradation, along with the atrophy of related mental faculties, such as creativity and intuition. After all, if an A.I. can provide us with viable solutions to problems, how can we feel the need to think outside-the-box in order to find them? And if an A.I. can make connecting with others online very easy, why would someone opt for face-to-face connections instead (unless their job dictates that)?
Bugs in Automated Processes
Automated processes may seem enticing through the abstraction they offer, but they are far from perfect. Even the most refined A.I. system may have some hidden issues under the hood, among its numerous hidden layers. Just because it can automate a process, it doesn't mean that there are no hidden biases in its functionality, or some (noticeably) wrong conclusions from time to time. This is natural, since every system is bound to fail at times. The problem is that if an A.I. system fails, we may not be able to correct it, while in some cases even perceiving its bug may be a hard task, let alone proving it to others.
Lack of Direct Experience of the World (VR and AR)
This is probably a bit futuristic, since if you live in a city outside the tech bubble (e.g. the West Coast of the US), there are plenty of opportunities for direct experience still. However, as technologies like virtual reality (VR) and augmented reality (AR) become cheaper and more commercially viable, they are bound to become the go-to interface for the world, e.g. through “tourism” apps or virtual “museums.” Although these technologies would be useful, particularly for people not having easy access to the rest of the world, there is no doubt that they are bound to be abused, resulting to some serious social problems, bringing about further societal fragmentation.
Blind Faith in A.I. Tech
This is probably the worst danger of A.I., which may seem similar to the first one mentioned, though it is more subtle and more sinister. The idea is that some people become very passionate about the merits of A.I. and quite defensive about their views. Their stance on the matter is eerily similar to some religious zealots, though the “prophets” of these A.I. movements may seem level-headed and detached. However, even they often fail to hide their borderline obsession with their ideology, whereby A.I. is deified. It’s one thing speculating about a future society where A.I. may have an administrative role in managing resources, and a completely different thing believing that A.I. will enter our lives and solve all our problems, like some nurturing alien god of sorts.
An Intelligent Approach to All This
Not all is doom and gloom, however. Identifying the dangers of A.I. is a good first step towards dealing with them. An intelligent way to do that is first to take responsibility for the whole matter. It’s not A.I.’s fault that these dangers come about. Just like every technology we've developed, A.I. can be used in different ways. If a car causes thousands of people to die every year it’s not the car’s fault. Also, just like a car was built to enrich our lives, A.I.’s development has similar motives. So, if we see it as an auxiliary technology that can help us make certain processes more efficient, rather than a panacea, we have a good chance of co-existing with it, without risking our individual and social integrity.
Although it's been over 2 weeks since I finished working on the Data Visualization video and about a month since I completed the Deep Learning one, both of them just got made available on Safari (a subscription based platform for various educational material). So, if you are up for some food for thought on DL and DV, check them out when you have a moment: Deep Learning vid and Data Visualization vid.
Note that these are both overview videos and although in the Data Viz one I include several references to libraries in Python and Julia for creating various plots, the videos are fairly high-level. These are not in-depth tutorials on the topics.
Once I decide to take a break from all the book-writing these days, I'll probably make another video either on AI or on a more conventional DS topic. So, stay tuned...
When it comes to DS education, nowadays there is a lot of emphasis given in one of two things: the math aspect of it, and the complex algorithms of deep learning systems. Although all this is essential, particularly if you want to be a future-proof data science professional, there is much more to the field than that. Namely, the engineer mentality is something that you need to cultivate, since at its core, data science is an engineering discipline. I don’t mean that in a software manner, but more of a practicality and efficiency oriented approach to building a system.
This is largely due to the scaling dimension of a data science metric or model. Unfortunately most data science “educators” fail to elaborate on this point, since they focus mainly on parroting other people’s work, instead of inciting students to gain a deeper understanding of the methods and processes being taught. Also, scaling something is the filter that distinguishes a robust algorithm from a mediocre one. As we obtain more and more data, having an algorithm that works well on a small dataset only (or one that requires a great deal of parallelization to yield any benefits), is not sustainable. Of course some people are happy with that, since they have a great deal of resources available, which they are happy to rent out. However, we can often obtain good enough results with less resources, through algorithms that have better scaling. Even if most people don’t share this fox-like approach to data science, it doesn’t make it less relevant. After all, many people associate methods with the frameworks particular companies offer, rather than understand the science behind these methods.
Scaling a method up intelligently is the product of three things:
1. having a deep understanding of a method
2. not relying on an abundance of resources to scale it up
3. being creative about the method, making compromises where necessary, to make it more lightweight
That’s where the engineering mentality comes it. The engineer understands the math, but isn’t concerned about having the perfect solution to a problem. Instead, he cares about having a good enough solution that is reliable and not too costly.
This kind of thinking is what drives the development of modern optimization systems, which are an important part of AI. Artificial Intelligence may involve things like deep learning networks, but there is more to it than that. So, if you want to delve more into this field and its numerous applications in data science, cultivating this engineering mentality is the optimal way to go. Perhaps not the absolute best one, but definitely one that works well and is efficient enough!
I've mentioned both in the DS Modeling Tutorial and in another article of mine the importance of discretization / binning of a continuous variable, as a strategy for turning it into a feature, to be used in a data model. However, how meaningful and information-rich the resulting categorical feature is going to be depends on the thresholds we use. In this post I'd like to share with you a strategy that I've come up with that works well in doing just that.
First of all, we need to make sure we have a potent method for calculating the density of a data point. I'm not talking about probability density though, since the latter is a statistical concept that has more to do with the mathematical form of a distribution than the actual density observed. The actual density is what we would measure if we were to look at the data itself and although it's quite straight-forward, it's not as easy to do at scale. That's why I first developed a very simple (almost simplistic) method for approximating density using a sampling of sorts, rather than looking at each individual element in the variable.
Afterwards, we just need to figure out the point of least density, that's not an extreme of the variable. In other words, identity of a local minimum in the density distribution, a fairly easy task that's also computationally cheap. Of course it's good to have a threshold too, to distinguish between this point being an actual low-density point and one that could be due to chance. If the density of that point is below this threshold, we can take it to be a point of dissection for the variable, effectively binarizing it.
Beyond that, we can repeat the same process recursively, for the two partitions of the variable. This way, we can end up with 3, 4, or even 100 partitions at the end of the process. This is another reason why this aforementioned threshold is very important. After all, not all partitions would be binarizable in a meaningful way. Also, it would be a good idea to have a limit to how many partitions overall we allow, so that we don't end up with a categorical variable having 1000 unique values either!
This optimal discretization / binning process is very simple and robust, resulting into a simpler form of the original variable, one that can be broken down to a set of binary features afterwards, if needed. This can also be useful in identifying potential outliers and being able to use them (as separate values in the new feature) instead of discarding them. The method is made even faster through its implementation in Julia, which once again proved itself as a great DS tool.
Recently I've started working on a new book (with Technics Publications like in all my other books). As a result, I will not be able to write articles as often in the months to come, since I'll be focusing on that. However, it is possible that I create new vids before the book is finished. I estimate the latter to take place sometime in June. If you stay tuned on this site, you'll be among the first ones to know!
Thank you all for your support through the purchasing of my publications, as well as through the watching of my videos on Safari. Ciao!
JuliaRun is Julia’s latest cloud-based version. In my book, Julia for Data Science, I’ve mentioned that there is an online version of the language, called JuliaBox. This version uses Jupyter as its front-end and runs on the cloud. JuliaRun is the next version of JuliaBox, still using Jupyter, but also offering various scalability options. JuliaRun is powered by the Microsoft cloud, aka Azure. However, there is an option of running it on your own cluster (ask the Julia Computing people for details).
Signing in JuliaRun is a fairly simple process. You just need to use either your GitHub credentials or your Google account. It’s not clear why someone has to be tied to an external party instead of having a Julia Computing account, but since creating a Google account is free, it's not a big issue! Also, it is a bit peculiar that JuliaRun doesn’t support Microsoft credentials, but then again, a MS account is not as popular as these other two sign-in options.
After you sign in, you need to accept the Terms of Service, a fairly straight-forward document, considering that it is a legal one. The most useful take-away from it is that if you leave your account inactive for about 4 months, it’s gone, so this is not for people who are not committed to using it.
Once you accept the ToS, you are taken to an IJulia directory, on Jupyter. This is where all your code notebooks are stored. The file system has a few things there already, the most noteworthy of which being a few tutorials. These are very helpful to get you started and also to demonstrate how Julia works in this platform. If you’ve never used IJulia before, there are also a good guide for that. Note that IJulia can run on Jupyter natively too, once you install the IJulia package and the Jupyter platform, on your machine.
Kernel and Functionality
The Julia version being used on JuliaRun is the latest stable release, which at the time of this writing is 0.6. However, the kernel version may differ for certain notebooks (e.g. for the JuliaTutorial one, it’s 0.5.2). Still, the differences between the last couple of versions are minute, for the most part. I’d recommend you go through the tutorials and also create some of your own test notebooks, before starting on a serious project, unless of course you use IJulia already on your computer.
Adding packages is fairly straight-forward, though it can be time-consuming as a process, especially if you have a lot of packages to install. Also, you have the option of installing a package in either one of the two latest versions of the language, or both, if you prefer. If you are more adventurous, you can even installed an unregistered package, by providing the corresponding URL.
You can also add code to JuliaRun through a Git repository (not necessarily GitHub). You just need to specify the URL of the repository, the branch, and which folder on JuliaBox you want to clone it in.
JuliaRun also offers a brief, but useful, help option. It mainly consists of a few FAQs, as well as an email address for more specialized questions. This is probably better than the long help pages in some other platforms that are next to impossible to navigate and are written by people who are terribly at writing. The help on this platform is brief, but comprehensive and with the user in mind.
For those who are closer to the metal and prefer the direct interaction with the Julia kernel, rather than the IJulia notebook interface, there is also the option to start a terminal. You can access that via the New button at the directory page.
From what I’ve seen of JuliaRun, both through a demo from the Julia team, and through my own experience, it is fairly easy to use. What I found very useful is that it doesn’t require any low-level data engineering expertise, though if you are good at working the processes of a cloud platform through the terminal or via Jupyter, that’s definitely useful. However, if you are someone more geared towards the high-level aspects of the craft, you can still do what you need to do, without spending too much time on the configurations.
I’d love to write more about this great platform that takes Julia to the next level, but this post is already too long. So, whenever you have a chance, give it a try and draw your own conclusions about this immensely useful tool.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.