Ever since social media (SM) became a mainstream option for spending one’s time on the web, it has started to disrupt the way we view information and even knowledge to some extent. Even though there is no doubt that SM offer substantial benefits in advertising and branding, there is little they can offer when it comes to actually learning something. Here is why.
Even though some articles can be thought-provoking, but consuming information to satisfy your curiosity and actually assimilating it are two different things. This is particularly true when it comes to a technical field, like data science, where being informed about something is barely enough to have an opinion on the topic, let alone do something useful with it. Many people who roam the SM in search of mentors don’t realize that. They tend to forget that following someone in an attempt to learn from them is the equivalent of body-building by just hanging out at the lobby of a gym. Yet, they do it anyway because it’s easy and it doesn’t cost them anything (other than some time, assuming that they read the stuff their leaders post on the SM).
If you really want to learn something, especially something complex and multifaceted like data science, you need to get your hands dirty and you have to break a sweat. The various things someone posts on the SM aren’t going to help much. There is a reason why books and videos on the subject sell, even if there is abundant information on the web. Also, in my experience, if a platform doesn’t charge you for the “products” it offers to you, that’s because you are the product! SM are designed with that in mind. Of course, some of them may be worth the time you spend on them since they can be a source of a diverse array of views on a topic (hopefully from different perspectives), but that’s not the same as applicable knowledge. If you want to hone your data science skills you need something you can rely on, not something someone types on the SM while enjoying their morning coffee, to pass the time.
So, what can you do, instead of following someone on the SM? There are various strategies, each with its own sets of benefits. Ideally, you would do a combination of them to maximize your learning opportunities. The main ones of these strategies are:
What are your thoughts on the matter? How do you learn data science?
For the past few months I've been working on a tutorial on the data modeling part of the data science process. Recently I've finished it and as of 2 weeks ago, it available online at the Safari portal. Although this tutorial is mainly for newcomers to the field, everyone can benefit from it, particularly people who are interested in not just the technical aspects but also on the concepts behind them and how it all relates to the other parts of the pipeline. Enjoy!
The idea of sampling is fundamental in data science and even though it is taught in every data science book or course out there, there is still a lot to be learned about it. The reason is that sampling is a very deep topic and just like every other data-related topic out there, conventional Statistics fails to do it justice. The reason is simple: good quality samples come about by obtaining an unbiased representation of a population and this is rarely the case from strictly random samples. Also, the fact that Statistics doesn’t offer any metric whatsoever regarding bias in a sample, doesn’t help the whole situation.
The Index of Bias (IB) of a Sample
There are two distinct aspects of a sample, its bias and its diversity. Here we’ll explore the former, as it is expressed in the two fundamental aspects of a distribution, its central point and its spread. For these aspects we’ll use two robust and fairly stable metrics, the median and the inter-quartile range, respectively. The deviation of a sample in terms of these metrics, with each deviation normalized based on the maximum deviation possible for the given data, yields two metrics, one of the central point and one for the spread. Each metric takes values between 0 and 1, inclusive. The average of these metrics is defined as the index of bias of a sample and takes values in the [0, 1] interval too. Note that the index of bias is always in relation to the original dataset we take the sample from.
Although the above definition applies to one-dimensional data only, it can be generalized to n-dimensional data too. For example, we can define the index of bias of a dataset comprising of d dimensions (features) as the arithmetic mean of the index of bias of each one of its features.
IB Scores for Various Samples
Strictly random samples tend to have a fairly high IB score, considering that we expect them to be unbiased. That’s not to say that they are always very biased, but they are definitely in need of improvement. Naturally, if the data we are sampling is multi-dimensional, the chances of a bias are higher, resulting to an overall biased sample.
Samples that are engineered with IB in mind are more likely to be unbiased in that sense. Naturally, this takes a bit of effort. Still, given enough random samples, it is possible to get a good enough sample that is unbiased based on this metric. In the attached file I include the IB scores of various samples, for both a random sampling process (first column) and a more meticulous one that aims to obtain a less biased sample (second column). Note that the latter did not use the IB metric in its algorithm, though a variant of it that makes use of that metric is also available (not free though). Also, you don’t need to be an expert in statistical tests to see that the second sampling method is consistently better than the first one. Finally, I did other tests on different data, and in every case, the results were very similar.
Hopefully, this small experiment goes on to show how sampling is not a trivial problem as it is made out to be by those who follow some old-fashioned paradigm for data analytics. Despite its simplicity, sampling has a lot of facets that need to be explored and understood, before it can be leveraged as a data science technique. After all, what good is an advanced model if the data it is trained on is biased? I believe we owe it to ourselves as data scientists to pay attention to every part of the data science process, including the less interesting parts, such as sampling.
Nowadays, more than ever before, there are a bunch of experts in the data science field, telling everyone what to think and what’s important. This, although useful to some extent, may be a hindrance after you reach a certain level of expertise. That’s not to say that experts’ views are useless, but it’s always good to take them with a pinch of salt.
Experts are people who have learned the field in such depth that they can think of it as people who speak a foreign language can think in terms of that language’s vocabulary and logical structures (e.g. grammar and syntax). An expert in our field doesn't see data science as something outside himself, but rather as a part of him, much like his ability to read and write. This level of intimacy with the know-how in data science enables him to perceive things that most people cannot, and offer deeper insights about the ins and outs of data science.
However, experts don’t know everything and it’s very easy for someone to become so enticed by his expertise that the boundaries of his understanding become blurred. This is a very dangerous thing, since the expert may have the false impression that he knows everything there is to know and/or that everything he knows is valid. However, data science is a very dynamic field, so even if you attain expertise in it, things change so some adaptation is in order. Some experts forget that.
Even if experts have a lot to teach us, we need to always be aware that there are things they do not know, or that they do not know well enough. For example, many experts are very knowledgeable about traditional statistics and whatever lies beyond that part of data science is secondary for them. Yet, even in the field of statistics they only know what they have learned and may lack the curiosity to explore different kinds of Stats, or the humility to acknowledge their existence. Experts like that will tell you that data science is all about statistics, reiterating the stuff they have learned. However, if you try to pinpoint the limitations of what they know, they will label you as a heretic, which is why most people don’t say anything back to them. This is dangerous though, since silence can strengthen their already inflated view of their authority, and bring about even stronger views in them.
That’s why the best approach is to try things out yourself. An expert makes a claim about a certain topic in data science; instead of taking it as fact, put it to the test it to see if it holds water. If it’s something that’s public knowledge, cross-reference it. If it’s something that can be verified or disproved through experimentation, write a script around it. Whatever the case, don’t take things for granted, just because some expert says so.
All this is related to developing the right mindset for data science, which is all about asking questions and trying to answer them in a methodical manner (aka the scientific method), using a variety of data analytics methods and lots of programming. Techniques and tools become obsolete sooner or later, but this mindset I’m referring to is always relevant…
First of all, let’s get something straight. I love Statistics and find their role in Data Science a very important one. I’d even go so far as to say that they are essential, even if you specialize in some part of data science that doesn’t need them per se. With this out of the way, I’d like to make the argument that the role of Stats in predictive analytics models in data science is very limited, especially nowadays. Before you move on to another website, bear with me, since even if you don’t agree, being aware of this perspective may be insightful to you.
In general terms, predictive analytics models in data science are the models we build to find the value of a variable using some other variables, usually referred to as features. A predictive analytics model can be anything that provides a mapping between these features and the variable we want to predict (the latter is usually referred to as the target variable). Depending on the nature of the target variable, we can have different methodologies, the most important of which are classification and regression. These are also the most commonly used predictive analytics models out there.
Statistics has been traditionally used in various ways in these predictive analytics models. This kind of statistics is under the umbrella of “inference statistics” and it used to have some merit when it came to predictions. However, nowadays there are much more robust models out there, some machine learning based, so A.I. based, and some that are combinations of various (non-statistical) models. Many of these models tend to perform quite well, while at the same time, they all refrain from making any assumptions about the data and the distributions it follows. Most inference statistical models are very limited in that respect as they expect their variables to follow certain distributions and/or to be independent of each other. Because of all that, nowadays data science professionals tend to rely on non-statistical methods for the predictive analytics models they develop.
That’s not to say that Stats are not useful though. They may still offer value in various ways, such as sampling, exploratory analysis, dimensionality reduction, etc. So, it’s good to have them in your toolbox, even if you’ll probably not rely on them if you plan to develop a more robust predictor in your data science project.
Is It Possible to Have a Set of Numbers in a Variable Where the Majority of Them Is Higher than the Mean?
!ommon sense would dictate that this is not possible. After all, there are numerous articles out there (particularly on the social media) using that as a sign of a fallacy in an argument. Things like “most people claim that they have better than average communication skills, which is obviously absurd!” are not uncommon. However, a data scientist is generally cautious when it comes to claims that are presented without proof, as she is naturally curious and eager to find out for herself if that’s indeed the case. So, let’s examine this possibility, free from prejudice and the views of the know-it-alls that seem to “know” the answer to this question, without ever using a programming language to at least verify their claim.
The question is clear-cut and well-defined. However, our common sense tells us that the answer is obvious. If we look into it more deeply though and are truly honest with ourselves, we’ll find out that this depends on the distribution of the data. A variable may or may not follow the normal distribution that we are accustomed to. If it doesn’t it’s quite likely that it is possible for the majority of the data points in a variable to be larger than the average value of that variable. After all, the average value (or arithmetic mean as it is more commonly known to people who have delved into this matter more), is just a measure of central tendency, certainly not the only measure for figuring out the center of a distribution. In the normal distribution, this metric coincides in value with that of median, which is always in the center of a variable, if you order its values in ascending or descending order. However, the claim that mean = median (in value) holds true only in cases of a symmetric distribution (like the normal distribution we are so accustomed to assuming it characterizes the data at hand). If the distribution is skewed, something quite common, it is possible to have a mean that is smaller than the median, in which case, the majority of the data points will be towards the right of it, or in layman’s terms, higher in value than the average.
Don’t take my word for it though! Attached is a script in Julia that generates an array that is quite likely to have the majority of its elements higher in value than its overall mean. Feel free to play around with it and find out by yourselves what the answer to this question is. After all, we are paid to answer questions using scientific processes, instead of taking someone else’s answers for granted, no matter who that person is.
We sometimes find ourselves in situations where no matter what we do, and what model we use, there just isn't anything useful coming out of our analysis. In times like these we wonder if an A.I. system would magically solve the problem. However, it may be the case that there just isn't any signal in the data that we are harvesting.
Of course, this whole thing sounds like a cop-out. It’s easy to say that there is no signal there and throw the towel. However, giving up too quickly is probably worse than not finding a signal there because doing so may eliminating finding something useful in that data ever. That’s why making the decision that there isn’t any signal worth extracting in the data is a tricky thing to do. We must make this decision only after thoroughly examining the data, trying out a variety of feature combinations as well as meta-features, and also experimenting with various models. If after doing all this we still end up with mediocre results that are hard to distinguish from chance, then there probably isn’t anything there, and we can proceed to another project.
However, just because there isn’t a strong enough signal in the data at hand it doesn’t make the whole idea trivial. Maybe there is potential in that idea but we need to pursue it via:
1. more and/or cleaner data like the data we have
2. different kinds of data, to be processed in tandem with the existing data
3. some other application based on that data
The 3rd point is particularly important. Say that we have transaction data, for example, and we want to predict fraud. The data we have is fine, but it is unable to predict anything worthwhile when it comes to fraud. We can still salvage some of the data science work we’ve done though and use it for predicting something else (e.g. some metric for evaluating the efficiency of a transaction, or the general reliability of the network used for these transactions). Just because we cannot predict fraud very well, it doesn’t make the data useless in general.
So, if the data doesn't turn into any viable insights or data products, that’s fine. Not all science experiments end in successful conclusions. We only hear about the success stories in the scientific literature, but for every successful experiment behind these stories there are several other ones that were unsuccessful. As long as we are not daunted by the results and continue working the data, there is always success on the horizon. This success may come about in a somewhat different project though, based on that data. That’s something worth keeping in mind, since it’s really the mindset we have that’s our best asset, even better than our data and our tools.
So, when I was in the US recently, I interviewed with some people from a Podcast geared towards SW engineering and data science topics (with some A.I. stuff too). This interview, which constitutes a whole episode on that podcast, covered various topics related to both data science as a field and some specific aspects of it that can help someone embrace it as a practitioner / professional in it. The podcast episode is now online and freely available. Although it's by no means a thorough coverage of the field of data science, or even the topic of the mindset related to it, it's a good introduction to it, engaging enough to keep your commute somewhat more interesting than listening to the radio. Enjoy!
How the Use of A.I. in the Road-based Logistics and Transportation Can Be Smooth and Congruent with the Current Status Quo
People talk a lot these days about how self-driving cars will solve all of our logistics and transportation related problems when they finally hit the roads. The thing is that the problems they are trying to solve are not as simple, nor is their adoption going to be as easy as these idealistic people think. Although there is nothing wrong with dreaming of a better future, free of traffic and avoidable accidents, it’s also important to look at this matter from a more realistic point of view.
First of all, the self-driving car needs to be re-examined. The idea of a car completely autonomous is a long ways from manifestation, even if there are A.I. systems out there that can navigate a car effectively over large distances. However, considering that these A.I. drivers will become the norm in the foreseeable future is quite unrealistic. The reason is simple economics. These systems are going to be very expensive, so they will naturally appeal only to a small part of the population. Also, as they gradually become more affordable, they will push down the price of conventional vehicles, making the latter more appealing. This is dynamic systems 101, something that apparently many of these visionaries of the self-driving cars are not that familiar with, just like they don’t understand people that well. If Joe and Jane find that this new self-driving car costs 50% more than the car they’ve been dreaming of for the past 5 years, because that particular make of car has been around forever and that model has been heavily advertised ever since they can remember, they will probably go with the conventional car, even if the self-driving car is an objectively better choice in general.
However, if A.I. systems in cars were to adopt an auxiliary role, much like Elon Musk envisions for his Tesla vehicles, then they have a chance. After all, not many people are willing to give up control of their cars just yet. This is evident when you talk with competent drivers who have been outside the US. These people take a strong interest in the stick-shift cars, since it gives them more control over the car, making them feel better about their role as drivers. Also, stick-shift cars are more economical, require less maintenance in terms of the transmission (e.g. no transmission fluids), and are generally quite reliable (as much as their automatic counterparts). Unless of course you never learn how to use the clutch, which is another matter!
If self-driving cars are self-driving only at certain times when the driver chooses to (e.g. in the case of a long road trip, or a mundane commute over I-90), then they can definitely add value. However, if they are entirely self-sufficient with no potential input from the human in the driver’s seat, then they are less likely to gain people’s trust, apart from those prejudiced towards their inherent value. Whatever the case, it is interesting to see how this new trend will evolve and what kind of data it will bring about for data science professionals to analyze!
People talk a lot these days about what it takes to be a good data scientist and how if you do their boot camp or join their course you will acquire that and make yourself stand out from the data scientist pool. Some of these people may be on to something but they generally focus a lot of specific skills and general abilities. That’s fine if you have the time to study what they are saying and find for yourself what you need. However, if you just want a single idea that is in the root of all the stuff they talk about, that’s something few can share with you, because they probably don’t know.
There are data scientists know, however, what it takes to be a good data scientist and many of them have already embodied this in their careers. Yet, they are so busy applying this that they don’t go out of their way to let you know, unless of course they are into education, in which case they will probably mention it in their books or videos.
One feature that I’ve found it succinctly summarizes what it takes to be a good data scientist, regardless of your domain or your specialization, is persistent engagement in the craft. Let’s break this down a bit, since it’s a fairly complex feature (a meta-feature if you will). This comprises of two things working in tandem: persistence and engagement. The first has to do with a sense of rhythm and commitment. All decent data scientists are very focused on what they are doing, even if they are involved in other things (e.g. 90-95% of my work is around data science, though I’m also involved in Cyber Security and to a smaller extent, in Neuroscience). Also, we tend to practice data science in one way or another very regularly. In other words, it is part of our daily routine. That’s all manifestations of consistency.
As for engagement, that is more of an inner state, an aspect of the mindset of a good data scientist. It involves being fascinated by the craft, even if it may seem that it doesn’t have any secrets from you any more. The thing is that there are always new things to learn, especially over time as it evolves and new methods and techniques come about. Engagement is akin to what is known in Zen as the “beginner’s mind” which is a certain approach to things as if they are completely new to you. Coupled with the experience and expertise that a good data scientist has, this approach allows him to go more in depth regarding the field and find new ways to bring about value through data science. It also involves coming up with new models, new processes for data engineering, and in some cases, new data products.
Consistent engagement in data science doesn’t require particular talent or experience, however. Everyone can (and ought to) embrace it. So, instead of trying to memorize the inner workings of some obscure model, just because someone else says so, try cultivating this trait first. Afterwards, everything else will appear easier and more interesting, just like new know-how appears intriguing and within reach, to a novice that has a genuine thirst for learning. After all, there are many ways to achieve mastery of the craft, but they all go through consistent engagement.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.