First of all, let’s get something straight. I love Statistics and find their role in Data Science a very important one. I’d even go so far as to say that they are essential, even if you specialize in some part of data science that doesn’t need them per se. With this out of the way, I’d like to make the argument that the role of Stats in predictive analytics models in data science is very limited, especially nowadays. Before you move on to another website, bear with me, since even if you don’t agree, being aware of this perspective may be insightful to you.
In general terms, predictive analytics models in data science are the models we build to find the value of a variable using some other variables, usually referred to as features. A predictive analytics model can be anything that provides a mapping between these features and the variable we want to predict (the latter is usually referred to as the target variable). Depending on the nature of the target variable, we can have different methodologies, the most important of which are classification and regression. These are also the most commonly used predictive analytics models out there.
Statistics has been traditionally used in various ways in these predictive analytics models. This kind of statistics is under the umbrella of “inference statistics” and it used to have some merit when it came to predictions. However, nowadays there are much more robust models out there, some machine learning based, so A.I. based, and some that are combinations of various (non-statistical) models. Many of these models tend to perform quite well, while at the same time, they all refrain from making any assumptions about the data and the distributions it follows. Most inference statistical models are very limited in that respect as they expect their variables to follow certain distributions and/or to be independent of each other. Because of all that, nowadays data science professionals tend to rely on non-statistical methods for the predictive analytics models they develop.
That’s not to say that Stats are not useful though. They may still offer value in various ways, such as sampling, exploratory analysis, dimensionality reduction, etc. So, it’s good to have them in your toolbox, even if you’ll probably not rely on them if you plan to develop a more robust predictor in your data science project.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.