Nowadays, there is a lot of confusion between the fields of Statistics and Machine Learning, which often hinders the development of the data science mindset. This is especially the case when it comes to innovation since this confusion can often lead someone to severe misconceptions about what's possible and the value of the data-driven paradigm. This article will look into this topic and explore how you can add value through a clearer understanding of it.
First of all, let's look at what statistics is. As you may recall, statistics is the sub-field that deals with data from a probabilistic standpoint. It involves building and using the mathematical models that describe how the data is distributed (distributions) and figuring out the likelihood of data points being part of these distributions through these functions. Many heuristics are used, though they are called statistic metrics or just statistics, and the whole field is generally very formalist. When data behaves as expected and when there is an abundance of it, statistics works fine, while it also yields predictive models that are transparent and easy to interpret or explain.
Now, what about machine learning? Well, this is a field of data analytics that involves an alternative approach to handling data, one that's more data-driven. In other words, machine learning involves analyzing the data without any ad-hoc notions about it, in the form of distributions or any other mathematical frameworks. That's not to say that machine learning doesn't involve math, far from it! However, the math in this field is used to analyze the data through heuristic-based and rule-based models, rather than probabilistic ones. Also, machine learning ties in very well with artificial intelligence (A.I.), so much so that they are often conflated. A.I. has a vital role in machine learning today, so the AI-based models often dominate in this field.
As you may have figured out by now, the critical difference between statistics and machine learning lies in the fact that statistics makes assumptions about the data (through the distribution functions, for example) while machine learning doesn't. This difference is an important one that makes all the difference (no pun intended) in the models' performance. After all, when data is complex and noisy, machine learning models tend to have better performance and generalize better. Perhaps that's why they are the most popular option in data science projects these days.
You can learn more about the topic by checking out my latest book, which covers machine learning from a practical and grounded perspective. Although the book focuses on how you can leverage Julia for machine learning work, it also covers some theoretical concepts (including this topic, in more depth). This way, it can help you cultivate the right mindset for data science and broaden your perspective on the data science field. So, check it out when you have a moment!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.