When it comes to DS education, nowadays there is a lot of emphasis given in one of two things: the math aspect of it, and the complex algorithms of deep learning systems. Although all this is essential, particularly if you want to be a future-proof data science professional, there is much more to the field than that. Namely, the engineer mentality is something that you need to cultivate, since at its core, data science is an engineering discipline. I don’t mean that in a software manner, but more of a practicality and efficiency oriented approach to building a system.
This is largely due to the scaling dimension of a data science metric or model. Unfortunately most data science “educators” fail to elaborate on this point, since they focus mainly on parroting other people’s work, instead of inciting students to gain a deeper understanding of the methods and processes being taught. Also, scaling something is the filter that distinguishes a robust algorithm from a mediocre one. As we obtain more and more data, having an algorithm that works well on a small dataset only (or one that requires a great deal of parallelization to yield any benefits), is not sustainable. Of course some people are happy with that, since they have a great deal of resources available, which they are happy to rent out. However, we can often obtain good enough results with less resources, through algorithms that have better scaling. Even if most people don’t share this fox-like approach to data science, it doesn’t make it less relevant. After all, many people associate methods with the frameworks particular companies offer, rather than understand the science behind these methods.
Scaling a method up intelligently is the product of three things:
1. having a deep understanding of a method
2. not relying on an abundance of resources to scale it up
3. being creative about the method, making compromises where necessary, to make it more lightweight
That’s where the engineering mentality comes it. The engineer understands the math, but isn’t concerned about having the perfect solution to a problem. Instead, he cares about having a good enough solution that is reliable and not too costly.
This kind of thinking is what drives the development of modern optimization systems, which are an important part of AI. Artificial Intelligence may involve things like deep learning networks, but there is more to it than that. So, if you want to delve more into this field and its numerous applications in data science, cultivating this engineering mentality is the optimal way to go. Perhaps not the absolute best one, but definitely one that works well and is efficient enough!
I've mentioned both in the DS Modeling Tutorial and in another article of mine the importance of discretization / binning of a continuous variable, as a strategy for turning it into a feature, to be used in a data model. However, how meaningful and information-rich the resulting categorical feature is going to be depends on the thresholds we use. In this post I'd like to share with you a strategy that I've come up with that works well in doing just that.
First of all, we need to make sure we have a potent method for calculating the density of a data point. I'm not talking about probability density though, since the latter is a statistical concept that has more to do with the mathematical form of a distribution than the actual density observed. The actual density is what we would measure if we were to look at the data itself and although it's quite straight-forward, it's not as easy to do at scale. That's why I first developed a very simple (almost simplistic) method for approximating density using a sampling of sorts, rather than looking at each individual element in the variable.
Afterwards, we just need to figure out the point of least density, that's not an extreme of the variable. In other words, identity of a local minimum in the density distribution, a fairly easy task that's also computationally cheap. Of course it's good to have a threshold too, to distinguish between this point being an actual low-density point and one that could be due to chance. If the density of that point is below this threshold, we can take it to be a point of dissection for the variable, effectively binarizing it.
Beyond that, we can repeat the same process recursively, for the two partitions of the variable. This way, we can end up with 3, 4, or even 100 partitions at the end of the process. This is another reason why this aforementioned threshold is very important. After all, not all partitions would be binarizable in a meaningful way. Also, it would be a good idea to have a limit to how many partitions overall we allow, so that we don't end up with a categorical variable having 1000 unique values either!
This optimal discretization / binning process is very simple and robust, resulting into a simpler form of the original variable, one that can be broken down to a set of binary features afterwards, if needed. This can also be useful in identifying potential outliers and being able to use them (as separate values in the new feature) instead of discarding them. The method is made even faster through its implementation in Julia, which once again proved itself as a great DS tool.
Recently I've started working on a new book (with Technics Publications like in all my other books). As a result, I will not be able to write articles as often in the months to come, since I'll be focusing on that. However, it is possible that I create new vids before the book is finished. I estimate the latter to take place sometime in June. If you stay tuned on this site, you'll be among the first ones to know!
Thank you all for your support through the purchasing of my publications, as well as through the watching of my videos on Safari. Ciao!
JuliaRun is Julia’s latest cloud-based version. In my book, Julia for Data Science, I’ve mentioned that there is an online version of the language, called JuliaBox. This version uses Jupyter as its front-end and runs on the cloud. JuliaRun is the next version of JuliaBox, still using Jupyter, but also offering various scalability options. JuliaRun is powered by the Microsoft cloud, aka Azure. However, there is an option of running it on your own cluster (ask the Julia Computing people for details).
Signing in JuliaRun is a fairly simple process. You just need to use either your GitHub credentials or your Google account. It’s not clear why someone has to be tied to an external party instead of having a Julia Computing account, but since creating a Google account is free, it's not a big issue! Also, it is a bit peculiar that JuliaRun doesn’t support Microsoft credentials, but then again, a MS account is not as popular as these other two sign-in options.
After you sign in, you need to accept the Terms of Service, a fairly straight-forward document, considering that it is a legal one. The most useful take-away from it is that if you leave your account inactive for about 4 months, it’s gone, so this is not for people who are not committed to using it.
Once you accept the ToS, you are taken to an IJulia directory, on Jupyter. This is where all your code notebooks are stored. The file system has a few things there already, the most noteworthy of which being a few tutorials. These are very helpful to get you started and also to demonstrate how Julia works in this platform. If you’ve never used IJulia before, there are also a good guide for that. Note that IJulia can run on Jupyter natively too, once you install the IJulia package and the Jupyter platform, on your machine.
Kernel and Functionality
The Julia version being used on JuliaRun is the latest stable release, which at the time of this writing is 0.6. However, the kernel version may differ for certain notebooks (e.g. for the JuliaTutorial one, it’s 0.5.2). Still, the differences between the last couple of versions are minute, for the most part. I’d recommend you go through the tutorials and also create some of your own test notebooks, before starting on a serious project, unless of course you use IJulia already on your computer.
Adding packages is fairly straight-forward, though it can be time-consuming as a process, especially if you have a lot of packages to install. Also, you have the option of installing a package in either one of the two latest versions of the language, or both, if you prefer. If you are more adventurous, you can even installed an unregistered package, by providing the corresponding URL.
You can also add code to JuliaRun through a Git repository (not necessarily GitHub). You just need to specify the URL of the repository, the branch, and which folder on JuliaBox you want to clone it in.
JuliaRun also offers a brief, but useful, help option. It mainly consists of a few FAQs, as well as an email address for more specialized questions. This is probably better than the long help pages in some other platforms that are next to impossible to navigate and are written by people who are terribly at writing. The help on this platform is brief, but comprehensive and with the user in mind.
For those who are closer to the metal and prefer the direct interaction with the Julia kernel, rather than the IJulia notebook interface, there is also the option to start a terminal. You can access that via the New button at the directory page.
From what I’ve seen of JuliaRun, both through a demo from the Julia team, and through my own experience, it is fairly easy to use. What I found very useful is that it doesn’t require any low-level data engineering expertise, though if you are good at working the processes of a cloud platform through the terminal or via Jupyter, that’s definitely useful. However, if you are someone more geared towards the high-level aspects of the craft, you can still do what you need to do, without spending too much time on the configurations.
I’d love to write more about this great platform that takes Julia to the next level, but this post is already too long. So, whenever you have a chance, give it a try and draw your own conclusions about this immensely useful tool.
Ever since social media (SM) became a mainstream option for spending one’s time on the web, it has started to disrupt the way we view information and even knowledge to some extent. Even though there is no doubt that SM offer substantial benefits in advertising and branding, there is little they can offer when it comes to actually learning something. Here is why.
Even though some articles can be thought-provoking, but consuming information to satisfy your curiosity and actually assimilating it are two different things. This is particularly true when it comes to a technical field, like data science, where being informed about something is barely enough to have an opinion on the topic, let alone do something useful with it. Many people who roam the SM in search of mentors don’t realize that. They tend to forget that following someone in an attempt to learn from them is the equivalent of body-building by just hanging out at the lobby of a gym. Yet, they do it anyway because it’s easy and it doesn’t cost them anything (other than some time, assuming that they read the stuff their leaders post on the SM).
If you really want to learn something, especially something complex and multifaceted like data science, you need to get your hands dirty and you have to break a sweat. The various things someone posts on the SM aren’t going to help much. There is a reason why books and videos on the subject sell, even if there is abundant information on the web. Also, in my experience, if a platform doesn’t charge you for the “products” it offers to you, that’s because you are the product! SM are designed with that in mind. Of course, some of them may be worth the time you spend on them since they can be a source of a diverse array of views on a topic (hopefully from different perspectives), but that’s not the same as applicable knowledge. If you want to hone your data science skills you need something you can rely on, not something someone types on the SM while enjoying their morning coffee, to pass the time.
So, what can you do, instead of following someone on the SM? There are various strategies, each with its own sets of benefits. Ideally, you would do a combination of them to maximize your learning opportunities. The main ones of these strategies are:
What are your thoughts on the matter? How do you learn data science?
For the past few months I've been working on a tutorial on the data modeling part of the data science process. Recently I've finished it and as of 2 weeks ago, it available online at the Safari portal. Although this tutorial is mainly for newcomers to the field, everyone can benefit from it, particularly people who are interested in not just the technical aspects but also on the concepts behind them and how it all relates to the other parts of the pipeline. Enjoy!
The idea of sampling is fundamental in data science and even though it is taught in every data science book or course out there, there is still a lot to be learned about it. The reason is that sampling is a very deep topic and just like every other data-related topic out there, conventional Statistics fails to do it justice. The reason is simple: good quality samples come about by obtaining an unbiased representation of a population and this is rarely the case from strictly random samples. Also, the fact that Statistics doesn’t offer any metric whatsoever regarding bias in a sample, doesn’t help the whole situation.
The Index of Bias (IB) of a Sample
There are two distinct aspects of a sample, its bias and its diversity. Here we’ll explore the former, as it is expressed in the two fundamental aspects of a distribution, its central point and its spread. For these aspects we’ll use two robust and fairly stable metrics, the median and the inter-quartile range, respectively. The deviation of a sample in terms of these metrics, with each deviation normalized based on the maximum deviation possible for the given data, yields two metrics, one of the central point and one for the spread. Each metric takes values between 0 and 1, inclusive. The average of these metrics is defined as the index of bias of a sample and takes values in the [0, 1] interval too. Note that the index of bias is always in relation to the original dataset we take the sample from.
Although the above definition applies to one-dimensional data only, it can be generalized to n-dimensional data too. For example, we can define the index of bias of a dataset comprising of d dimensions (features) as the arithmetic mean of the index of bias of each one of its features.
IB Scores for Various Samples
Strictly random samples tend to have a fairly high IB score, considering that we expect them to be unbiased. That’s not to say that they are always very biased, but they are definitely in need of improvement. Naturally, if the data we are sampling is multi-dimensional, the chances of a bias are higher, resulting to an overall biased sample.
Samples that are engineered with IB in mind are more likely to be unbiased in that sense. Naturally, this takes a bit of effort. Still, given enough random samples, it is possible to get a good enough sample that is unbiased based on this metric. In the attached file I include the IB scores of various samples, for both a random sampling process (first column) and a more meticulous one that aims to obtain a less biased sample (second column). Note that the latter did not use the IB metric in its algorithm, though a variant of it that makes use of that metric is also available (not free though). Also, you don’t need to be an expert in statistical tests to see that the second sampling method is consistently better than the first one. Finally, I did other tests on different data, and in every case, the results were very similar.
Hopefully, this small experiment goes on to show how sampling is not a trivial problem as it is made out to be by those who follow some old-fashioned paradigm for data analytics. Despite its simplicity, sampling has a lot of facets that need to be explored and understood, before it can be leveraged as a data science technique. After all, what good is an advanced model if the data it is trained on is biased? I believe we owe it to ourselves as data scientists to pay attention to every part of the data science process, including the less interesting parts, such as sampling.
Nowadays, more than ever before, there are a bunch of experts in the data science field, telling everyone what to think and what’s important. This, although useful to some extent, may be a hindrance after you reach a certain level of expertise. That’s not to say that experts’ views are useless, but it’s always good to take them with a pinch of salt.
Experts are people who have learned the field in such depth that they can think of it as people who speak a foreign language can think in terms of that language’s vocabulary and logical structures (e.g. grammar and syntax). An expert in our field doesn't see data science as something outside himself, but rather as a part of him, much like his ability to read and write. This level of intimacy with the know-how in data science enables him to perceive things that most people cannot, and offer deeper insights about the ins and outs of data science.
However, experts don’t know everything and it’s very easy for someone to become so enticed by his expertise that the boundaries of his understanding become blurred. This is a very dangerous thing, since the expert may have the false impression that he knows everything there is to know and/or that everything he knows is valid. However, data science is a very dynamic field, so even if you attain expertise in it, things change so some adaptation is in order. Some experts forget that.
Even if experts have a lot to teach us, we need to always be aware that there are things they do not know, or that they do not know well enough. For example, many experts are very knowledgeable about traditional statistics and whatever lies beyond that part of data science is secondary for them. Yet, even in the field of statistics they only know what they have learned and may lack the curiosity to explore different kinds of Stats, or the humility to acknowledge their existence. Experts like that will tell you that data science is all about statistics, reiterating the stuff they have learned. However, if you try to pinpoint the limitations of what they know, they will label you as a heretic, which is why most people don’t say anything back to them. This is dangerous though, since silence can strengthen their already inflated view of their authority, and bring about even stronger views in them.
That’s why the best approach is to try things out yourself. An expert makes a claim about a certain topic in data science; instead of taking it as fact, put it to the test it to see if it holds water. If it’s something that’s public knowledge, cross-reference it. If it’s something that can be verified or disproved through experimentation, write a script around it. Whatever the case, don’t take things for granted, just because some expert says so.
All this is related to developing the right mindset for data science, which is all about asking questions and trying to answer them in a methodical manner (aka the scientific method), using a variety of data analytics methods and lots of programming. Techniques and tools become obsolete sooner or later, but this mindset I’m referring to is always relevant…
First of all, let’s get something straight. I love Statistics and find their role in Data Science a very important one. I’d even go so far as to say that they are essential, even if you specialize in some part of data science that doesn’t need them per se. With this out of the way, I’d like to make the argument that the role of Stats in predictive analytics models in data science is very limited, especially nowadays. Before you move on to another website, bear with me, since even if you don’t agree, being aware of this perspective may be insightful to you.
In general terms, predictive analytics models in data science are the models we build to find the value of a variable using some other variables, usually referred to as features. A predictive analytics model can be anything that provides a mapping between these features and the variable we want to predict (the latter is usually referred to as the target variable). Depending on the nature of the target variable, we can have different methodologies, the most important of which are classification and regression. These are also the most commonly used predictive analytics models out there.
Statistics has been traditionally used in various ways in these predictive analytics models. This kind of statistics is under the umbrella of “inference statistics” and it used to have some merit when it came to predictions. However, nowadays there are much more robust models out there, some machine learning based, so A.I. based, and some that are combinations of various (non-statistical) models. Many of these models tend to perform quite well, while at the same time, they all refrain from making any assumptions about the data and the distributions it follows. Most inference statistical models are very limited in that respect as they expect their variables to follow certain distributions and/or to be independent of each other. Because of all that, nowadays data science professionals tend to rely on non-statistical methods for the predictive analytics models they develop.
That’s not to say that Stats are not useful though. They may still offer value in various ways, such as sampling, exploratory analysis, dimensionality reduction, etc. So, it’s good to have them in your toolbox, even if you’ll probably not rely on them if you plan to develop a more robust predictor in your data science project.
Is It Possible to Have a Set of Numbers in a Variable Where the Majority of Them Is Higher than the Mean?
!ommon sense would dictate that this is not possible. After all, there are numerous articles out there (particularly on the social media) using that as a sign of a fallacy in an argument. Things like “most people claim that they have better than average communication skills, which is obviously absurd!” are not uncommon. However, a data scientist is generally cautious when it comes to claims that are presented without proof, as she is naturally curious and eager to find out for herself if that’s indeed the case. So, let’s examine this possibility, free from prejudice and the views of the know-it-alls that seem to “know” the answer to this question, without ever using a programming language to at least verify their claim.
The question is clear-cut and well-defined. However, our common sense tells us that the answer is obvious. If we look into it more deeply though and are truly honest with ourselves, we’ll find out that this depends on the distribution of the data. A variable may or may not follow the normal distribution that we are accustomed to. If it doesn’t it’s quite likely that it is possible for the majority of the data points in a variable to be larger than the average value of that variable. After all, the average value (or arithmetic mean as it is more commonly known to people who have delved into this matter more), is just a measure of central tendency, certainly not the only measure for figuring out the center of a distribution. In the normal distribution, this metric coincides in value with that of median, which is always in the center of a variable, if you order its values in ascending or descending order. However, the claim that mean = median (in value) holds true only in cases of a symmetric distribution (like the normal distribution we are so accustomed to assuming it characterizes the data at hand). If the distribution is skewed, something quite common, it is possible to have a mean that is smaller than the median, in which case, the majority of the data points will be towards the right of it, or in layman’s terms, higher in value than the average.
Don’t take my word for it though! Attached is a script in Julia that generates an array that is quite likely to have the majority of its elements higher in value than its overall mean. Feel free to play around with it and find out by yourselves what the answer to this question is. After all, we are paid to answer questions using scientific processes, instead of taking someone else’s answers for granted, no matter who that person is.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.