Over the years I have been a bit harsh on Statistics, esp. ever since I got exposed to the propaganda that Stats is the way to go in Data Science. This idea that statistical analyses would be able to help us tackle big data problems didn't (and still doesn't) make any sense, esp. if you have run experiments on various data analytics methods, statistical and otherwise, for several years. Although statistical methods have merit, they have been proven to be less effective or efficient as machine learning methods, esp. A.I. methods, like ANNs. Yet, statistics are still useful in some ways, still.
Data analytics involves more than just building models. Before we reach that stage where we have a dataset that we are ready to use for predicting or analyzing something, to build a data product or derive some useful insights, we need to build that dataset. To do that, we often need to get our hands dirty by doing a lot of experiments with the data itself, using a variety of methods. Some of these methods derive from statistics. For example, we may need to explore the relationship between two variables (or over all the pairs of variables available). This is made possible with various methods, such as correlation and covariance. Even if these tools are suboptimal, they are a good starting point and in many cases, they may even suffice. Also, PCA and SVD remain very popular dimensionality reduction methods that are under the statistics umbrella.
Another example where statistics come in handy is when you need to check the validity of a hypothesis. Although there are some simulation-based methods that can do that, statistics has a variety of tools that cover several possibilities of variables and their distributions, enabling us to test our hypotheses in a methodical and rigorous manner. Of course, we may still need to do some analysis beyond that, to establish the stability of our results, but there is no doubt that statistical tests can be useful as a first step.
Finally, when it comes to sampling, statistics is usually our go-to framework. This set of simple techniques for obtaining a subset of a larger dataset may seem, well, simplistic, but it’s essential. After all, even the most sophisticated machine learning models are bound to fail (over-fit), if sampling isn't done right. There is a reason why statistics became a popular data analytics framework, and it’s quite likely that sampling played an important role in this (though I’ll need to run some tests to establish an exact measure of the likelihood!).
So, even if A.I. and machine learning are the foxy way to go when it comes to data science, statistics have a place in the data scientist’s toolbox too. Plus, with so many people in data science focusing on the new and better tools that are in vogue these days, maybe a differentiator of a competent data scientist in the future will be how well she can handle statistical concepts and carry out basic tasks in a methodical manner. Besides, if it’s one thing that statistics can teach us it’s being methodical and scientific in how we conduct our analyses, qualities that are timeless in the data science field and foxy in their own way.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.