New Book in the Works...
Recently I've started working on a new book (with Technics Publications like in all my other books). As a result, I will not be able to write articles as often in the months to come, since I'll be focusing on that. However, it is possible that I create new vids before the book is finished. I estimate the latter to take place sometime in June. If you stay tuned on this site, you'll be among the first ones to know!
Thank you all for your support through the purchasing of my publications, as well as through the watching of my videos on Safari. Ciao!
JuliaRun is Julia’s latest cloud-based version. In my book, Julia for Data Science, I’ve mentioned that there is an online version of the language, called JuliaBox. This version uses Jupyter as its front-end and runs on the cloud. JuliaRun is the next version of JuliaBox, still using Jupyter, but also offering various scalability options. JuliaRun is powered by the Microsoft cloud, aka Azure. However, there is an option of running it on your own cluster (ask the Julia Computing people for details).
Signing in JuliaRun is a fairly simple process. You just need to use either your GitHub credentials or your Google account. It’s not clear why someone has to be tied to an external party instead of having a Julia Computing account, but since creating a Google account is free, it's not a big issue! Also, it is a bit peculiar that JuliaRun doesn’t support Microsoft credentials, but then again, a MS account is not as popular as these other two sign-in options.
After you sign in, you need to accept the Terms of Service, a fairly straight-forward document, considering that it is a legal one. The most useful take-away from it is that if you leave your account inactive for about 4 months, it’s gone, so this is not for people who are not committed to using it.
Once you accept the ToS, you are taken to an IJulia directory, on Jupyter. This is where all your code notebooks are stored. The file system has a few things there already, the most noteworthy of which being a few tutorials. These are very helpful to get you started and also to demonstrate how Julia works in this platform. If you’ve never used IJulia before, there are also a good guide for that. Note that IJulia can run on Jupyter natively too, once you install the IJulia package and the Jupyter platform, on your machine.
Kernel and Functionality
The Julia version being used on JuliaRun is the latest stable release, which at the time of this writing is 0.6. However, the kernel version may differ for certain notebooks (e.g. for the JuliaTutorial one, it’s 0.5.2). Still, the differences between the last couple of versions are minute, for the most part. I’d recommend you go through the tutorials and also create some of your own test notebooks, before starting on a serious project, unless of course you use IJulia already on your computer.
Adding packages is fairly straight-forward, though it can be time-consuming as a process, especially if you have a lot of packages to install. Also, you have the option of installing a package in either one of the two latest versions of the language, or both, if you prefer. If you are more adventurous, you can even installed an unregistered package, by providing the corresponding URL.
You can also add code to JuliaRun through a Git repository (not necessarily GitHub). You just need to specify the URL of the repository, the branch, and which folder on JuliaBox you want to clone it in.
JuliaRun also offers a brief, but useful, help option. It mainly consists of a few FAQs, as well as an email address for more specialized questions. This is probably better than the long help pages in some other platforms that are next to impossible to navigate and are written by people who are terribly at writing. The help on this platform is brief, but comprehensive and with the user in mind.
For those who are closer to the metal and prefer the direct interaction with the Julia kernel, rather than the IJulia notebook interface, there is also the option to start a terminal. You can access that via the New button at the directory page.
From what I’ve seen of JuliaRun, both through a demo from the Julia team, and through my own experience, it is fairly easy to use. What I found very useful is that it doesn’t require any low-level data engineering expertise, though if you are good at working the processes of a cloud platform through the terminal or via Jupyter, that’s definitely useful. However, if you are someone more geared towards the high-level aspects of the craft, you can still do what you need to do, without spending too much time on the configurations.
I’d love to write more about this great platform that takes Julia to the next level, but this post is already too long. So, whenever you have a chance, give it a try and draw your own conclusions about this immensely useful tool.
Why Following Someone on the Social Media Is Not an Effective Learning Strategy for Data Science
Ever since social media (SM) became a mainstream option for spending one’s time on the web, it has started to disrupt the way we view information and even knowledge to some extent. Even though there is no doubt that SM offer substantial benefits in advertising and branding, there is little they can offer when it comes to actually learning something. Here is why.
Even though some articles can be thought-provoking, but consuming information to satisfy your curiosity and actually assimilating it are two different things. This is particularly true when it comes to a technical field, like data science, where being informed about something is barely enough to have an opinion on the topic, let alone do something useful with it. Many people who roam the SM in search of mentors don’t realize that. They tend to forget that following someone in an attempt to learn from them is the equivalent of body-building by just hanging out at the lobby of a gym. Yet, they do it anyway because it’s easy and it doesn’t cost them anything (other than some time, assuming that they read the stuff their leaders post on the SM).
If you really want to learn something, especially something complex and multifaceted like data science, you need to get your hands dirty and you have to break a sweat. The various things someone posts on the SM aren’t going to help much. There is a reason why books and videos on the subject sell, even if there is abundant information on the web. Also, in my experience, if a platform doesn’t charge you for the “products” it offers to you, that’s because you are the product! SM are designed with that in mind. Of course, some of them may be worth the time you spend on them since they can be a source of a diverse array of views on a topic (hopefully from different perspectives), but that’s not the same as applicable knowledge. If you want to hone your data science skills you need something you can rely on, not something someone types on the SM while enjoying their morning coffee, to pass the time.
So, what can you do, instead of following someone on the SM? There are various strategies, each with its own sets of benefits. Ideally, you would do a combination of them to maximize your learning opportunities. The main ones of these strategies are:
What are your thoughts on the matter? How do you learn data science?
For the past few months I've been working on a tutorial on the data modeling part of the data science process. Recently I've finished it and as of 2 weeks ago, it available online at the Safari portal. Although this tutorial is mainly for newcomers to the field, everyone can benefit from it, particularly people who are interested in not just the technical aspects but also on the concepts behind them and how it all relates to the other parts of the pipeline. Enjoy!
The idea of sampling is fundamental in data science and even though it is taught in every data science book or course out there, there is still a lot to be learned about it. The reason is that sampling is a very deep topic and just like every other data-related topic out there, conventional Statistics fails to do it justice. The reason is simple: good quality samples come about by obtaining an unbiased representation of a population and this is rarely the case from strictly random samples. Also, the fact that Statistics doesn’t offer any metric whatsoever regarding bias in a sample, doesn’t help the whole situation.
The Index of Bias (IB) of a Sample
There are two distinct aspects of a sample, its bias and its diversity. Here we’ll explore the former, as it is expressed in the two fundamental aspects of a distribution, its central point and its spread. For these aspects we’ll use two robust and fairly stable metrics, the median and the inter-quartile range, respectively. The deviation of a sample in terms of these metrics, with each deviation normalized based on the maximum deviation possible for the given data, yields two metrics, one of the central point and one for the spread. Each metric takes values between 0 and 1, inclusive. The average of these metrics is defined as the index of bias of a sample and takes values in the [0, 1] interval too. Note that the index of bias is always in relation to the original dataset we take the sample from.
Although the above definition applies to one-dimensional data only, it can be generalized to n-dimensional data too. For example, we can define the index of bias of a dataset comprising of d dimensions (features) as the arithmetic mean of the index of bias of each one of its features.
IB Scores for Various Samples
Strictly random samples tend to have a fairly high IB score, considering that we expect them to be unbiased. That’s not to say that they are always very biased, but they are definitely in need of improvement. Naturally, if the data we are sampling is multi-dimensional, the chances of a bias are higher, resulting to an overall biased sample.
Samples that are engineered with IB in mind are more likely to be unbiased in that sense. Naturally, this takes a bit of effort. Still, given enough random samples, it is possible to get a good enough sample that is unbiased based on this metric. In the attached file I include the IB scores of various samples, for both a random sampling process (first column) and a more meticulous one that aims to obtain a less biased sample (second column). Note that the latter did not use the IB metric in its algorithm, though a variant of it that makes use of that metric is also available (not free though). Also, you don’t need to be an expert in statistical tests to see that the second sampling method is consistently better than the first one. Finally, I did other tests on different data, and in every case, the results were very similar.
Hopefully, this small experiment goes on to show how sampling is not a trivial problem as it is made out to be by those who follow some old-fashioned paradigm for data analytics. Despite its simplicity, sampling has a lot of facets that need to be explored and understood, before it can be leveraged as a data science technique. After all, what good is an advanced model if the data it is trained on is biased? I believe we owe it to ourselves as data scientists to pay attention to every part of the data science process, including the less interesting parts, such as sampling.
Nowadays, more than ever before, there are a bunch of experts in the data science field, telling everyone what to think and what’s important. This, although useful to some extent, may be a hindrance after you reach a certain level of expertise. That’s not to say that experts’ views are useless, but it’s always good to take them with a pinch of salt.
Experts are people who have learned the field in such depth that they can think of it as people who speak a foreign language can think in terms of that language’s vocabulary and logical structures (e.g. grammar and syntax). An expert in our field doesn't see data science as something outside himself, but rather as a part of him, much like his ability to read and write. This level of intimacy with the know-how in data science enables him to perceive things that most people cannot, and offer deeper insights about the ins and outs of data science.
However, experts don’t know everything and it’s very easy for someone to become so enticed by his expertise that the boundaries of his understanding become blurred. This is a very dangerous thing, since the expert may have the false impression that he knows everything there is to know and/or that everything he knows is valid. However, data science is a very dynamic field, so even if you attain expertise in it, things change so some adaptation is in order. Some experts forget that.
Even if experts have a lot to teach us, we need to always be aware that there are things they do not know, or that they do not know well enough. For example, many experts are very knowledgeable about traditional statistics and whatever lies beyond that part of data science is secondary for them. Yet, even in the field of statistics they only know what they have learned and may lack the curiosity to explore different kinds of Stats, or the humility to acknowledge their existence. Experts like that will tell you that data science is all about statistics, reiterating the stuff they have learned. However, if you try to pinpoint the limitations of what they know, they will label you as a heretic, which is why most people don’t say anything back to them. This is dangerous though, since silence can strengthen their already inflated view of their authority, and bring about even stronger views in them.
That’s why the best approach is to try things out yourself. An expert makes a claim about a certain topic in data science; instead of taking it as fact, put it to the test it to see if it holds water. If it’s something that’s public knowledge, cross-reference it. If it’s something that can be verified or disproved through experimentation, write a script around it. Whatever the case, don’t take things for granted, just because some expert says so.
All this is related to developing the right mindset for data science, which is all about asking questions and trying to answer them in a methodical manner (aka the scientific method), using a variety of data analytics methods and lots of programming. Techniques and tools become obsolete sooner or later, but this mindset I’m referring to is always relevant…
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.