First of all, let’s get something straight. I love Statistics and find their role in Data Science a very important one. I’d even go so far as to say that they are essential, even if you specialize in some part of data science that doesn’t need them per se. With this out of the way, I’d like to make the argument that the role of Stats in predictive analytics models in data science is very limited, especially nowadays. Before you move on to another website, bear with me, since even if you don’t agree, being aware of this perspective may be insightful to you.
In general terms, predictive analytics models in data science are the models we build to find the value of a variable using some other variables, usually referred to as features. A predictive analytics model can be anything that provides a mapping between these features and the variable we want to predict (the latter is usually referred to as the target variable). Depending on the nature of the target variable, we can have different methodologies, the most important of which are classification and regression. These are also the most commonly used predictive analytics models out there.
Statistics has been traditionally used in various ways in these predictive analytics models. This kind of statistics is under the umbrella of “inference statistics” and it used to have some merit when it came to predictions. However, nowadays there are much more robust models out there, some machine learning based, so A.I. based, and some that are combinations of various (non-statistical) models. Many of these models tend to perform quite well, while at the same time, they all refrain from making any assumptions about the data and the distributions it follows. Most inference statistical models are very limited in that respect as they expect their variables to follow certain distributions and/or to be independent of each other. Because of all that, nowadays data science professionals tend to rely on non-statistical methods for the predictive analytics models they develop.
That’s not to say that Stats are not useful though. They may still offer value in various ways, such as sampling, exploratory analysis, dimensionality reduction, etc. So, it’s good to have them in your toolbox, even if you’ll probably not rely on them if you plan to develop a more robust predictor in your data science project.
Is It Possible to Have a Set of Numbers in a Variable Where the Majority of Them Is Higher than the Mean?
!ommon sense would dictate that this is not possible. After all, there are numerous articles out there (particularly on the social media) using that as a sign of a fallacy in an argument. Things like “most people claim that they have better than average communication skills, which is obviously absurd!” are not uncommon. However, a data scientist is generally cautious when it comes to claims that are presented without proof, as she is naturally curious and eager to find out for herself if that’s indeed the case. So, let’s examine this possibility, free from prejudice and the views of the know-it-alls that seem to “know” the answer to this question, without ever using a programming language to at least verify their claim.
The question is clear-cut and well-defined. However, our common sense tells us that the answer is obvious. If we look into it more deeply though and are truly honest with ourselves, we’ll find out that this depends on the distribution of the data. A variable may or may not follow the normal distribution that we are accustomed to. If it doesn’t it’s quite likely that it is possible for the majority of the data points in a variable to be larger than the average value of that variable. After all, the average value (or arithmetic mean as it is more commonly known to people who have delved into this matter more), is just a measure of central tendency, certainly not the only measure for figuring out the center of a distribution. In the normal distribution, this metric coincides in value with that of median, which is always in the center of a variable, if you order its values in ascending or descending order. However, the claim that mean = median (in value) holds true only in cases of a symmetric distribution (like the normal distribution we are so accustomed to assuming it characterizes the data at hand). If the distribution is skewed, something quite common, it is possible to have a mean that is smaller than the median, in which case, the majority of the data points will be towards the right of it, or in layman’s terms, higher in value than the average.
Don’t take my word for it though! Attached is a script in Julia that generates an array that is quite likely to have the majority of its elements higher in value than its overall mean. Feel free to play around with it and find out by yourselves what the answer to this question is. After all, we are paid to answer questions using scientific processes, instead of taking someone else’s answers for granted, no matter who that person is.
We sometimes find ourselves in situations where no matter what we do, and what model we use, there just isn't anything useful coming out of our analysis. In times like these we wonder if an A.I. system would magically solve the problem. However, it may be the case that there just isn't any signal in the data that we are harvesting.
Of course, this whole thing sounds like a cop-out. It’s easy to say that there is no signal there and throw the towel. However, giving up too quickly is probably worse than not finding a signal there because doing so may eliminating finding something useful in that data ever. That’s why making the decision that there isn’t any signal worth extracting in the data is a tricky thing to do. We must make this decision only after thoroughly examining the data, trying out a variety of feature combinations as well as meta-features, and also experimenting with various models. If after doing all this we still end up with mediocre results that are hard to distinguish from chance, then there probably isn’t anything there, and we can proceed to another project.
However, just because there isn’t a strong enough signal in the data at hand it doesn’t make the whole idea trivial. Maybe there is potential in that idea but we need to pursue it via:
1. more and/or cleaner data like the data we have
2. different kinds of data, to be processed in tandem with the existing data
3. some other application based on that data
The 3rd point is particularly important. Say that we have transaction data, for example, and we want to predict fraud. The data we have is fine, but it is unable to predict anything worthwhile when it comes to fraud. We can still salvage some of the data science work we’ve done though and use it for predicting something else (e.g. some metric for evaluating the efficiency of a transaction, or the general reliability of the network used for these transactions). Just because we cannot predict fraud very well, it doesn’t make the data useless in general.
So, if the data doesn't turn into any viable insights or data products, that’s fine. Not all science experiments end in successful conclusions. We only hear about the success stories in the scientific literature, but for every successful experiment behind these stories there are several other ones that were unsuccessful. As long as we are not daunted by the results and continue working the data, there is always success on the horizon. This success may come about in a somewhat different project though, based on that data. That’s something worth keeping in mind, since it’s really the mindset we have that’s our best asset, even better than our data and our tools.
So, when I was in the US recently, I interviewed with some people from a Podcast geared towards SW engineering and data science topics (with some A.I. stuff too). This interview, which constitutes a whole episode on that podcast, covered various topics related to both data science as a field and some specific aspects of it that can help someone embrace it as a practitioner / professional in it. The podcast episode is now online and freely available. Although it's by no means a thorough coverage of the field of data science, or even the topic of the mindset related to it, it's a good introduction to it, engaging enough to keep your commute somewhat more interesting than listening to the radio. Enjoy!
How the Use of A.I. in the Road-based Logistics and Transportation Can Be Smooth and Congruent with the Current Status Quo
People talk a lot these days about how self-driving cars will solve all of our logistics and transportation related problems when they finally hit the roads. The thing is that the problems they are trying to solve are not as simple, nor is their adoption going to be as easy as these idealistic people think. Although there is nothing wrong with dreaming of a better future, free of traffic and avoidable accidents, it’s also important to look at this matter from a more realistic point of view.
First of all, the self-driving car needs to be re-examined. The idea of a car completely autonomous is a long ways from manifestation, even if there are A.I. systems out there that can navigate a car effectively over large distances. However, considering that these A.I. drivers will become the norm in the foreseeable future is quite unrealistic. The reason is simple economics. These systems are going to be very expensive, so they will naturally appeal only to a small part of the population. Also, as they gradually become more affordable, they will push down the price of conventional vehicles, making the latter more appealing. This is dynamic systems 101, something that apparently many of these visionaries of the self-driving cars are not that familiar with, just like they don’t understand people that well. If Joe and Jane find that this new self-driving car costs 50% more than the car they’ve been dreaming of for the past 5 years, because that particular make of car has been around forever and that model has been heavily advertised ever since they can remember, they will probably go with the conventional car, even if the self-driving car is an objectively better choice in general.
However, if A.I. systems in cars were to adopt an auxiliary role, much like Elon Musk envisions for his Tesla vehicles, then they have a chance. After all, not many people are willing to give up control of their cars just yet. This is evident when you talk with competent drivers who have been outside the US. These people take a strong interest in the stick-shift cars, since it gives them more control over the car, making them feel better about their role as drivers. Also, stick-shift cars are more economical, require less maintenance in terms of the transmission (e.g. no transmission fluids), and are generally quite reliable (as much as their automatic counterparts). Unless of course you never learn how to use the clutch, which is another matter!
If self-driving cars are self-driving only at certain times when the driver chooses to (e.g. in the case of a long road trip, or a mundane commute over I-90), then they can definitely add value. However, if they are entirely self-sufficient with no potential input from the human in the driver’s seat, then they are less likely to gain people’s trust, apart from those prejudiced towards their inherent value. Whatever the case, it is interesting to see how this new trend will evolve and what kind of data it will bring about for data science professionals to analyze!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.