There is a certain idea about this matter that I find particularly vexing and misleading, as it paints a very limiting picture of what data science is. There are people who have entered the field through Statistics, as there is a direct link between data science and Stats. However, for reasons of their own, these people tend to view data science as part of Statistics, or sometimes a branch of it. Let’s delve into this matter and clarify this complicated relationship between data science and Stats, before things get out of hand.
First of all, let’s get some definitions in place. Statistics is a sub-field of Mathematics involving the description and analysis of data, particularly numeric data, through a variety of models and processes. It is a very useful framework that is essential in data science. As for the latter, it is usually viewed as a new field, one that comprises of several other fields, such as computer science, business, communication, and mathematical modeling. In other words, it’s an inter-disciplinary field that borrows from several other fields, in order to tackle complex problems that couldn't be solved otherwise.
As I have repeatedly stated in my books as well as many of my videos, a data scientist needs to know a variety of things, particularly programming. Statisticians usually focus on all-in-one platforms, like R, SAS, SPSS, etc. for their scripts. These are not the same as full-blown programming languages like Python, Julia, Scala, etc. that are usually used in data science. So, if someone calls data science a part of Statistics is not only inaccurate, but a sign that he doesn't understand what data science entails.
Also, data science tackles a large variety of data types, including text. In fact, there are a lot of data scientists who focus primarily on text data, while there are various methodologies that aim to quantify text data, in a way that enables the analysis of a corpus using a mathematical model. Statistics is unable to tackle any data of this kind, even if data scientists oftentimes make use of Stats when analyzing the quantified text data.
Moreover, Statistics tend to make use of certain models that are based on a number of assumptions about the distribution of the data, or some characteristics of it. Many data science methods don’t have any assumptions about the data. This allows for more versatile models that exhibit a more robust performance, oftentimes unattainable by statistical models. So, if someone claims that data science is part of Stats, they are probably oblivious of Machine Learning and A.I. systems employed in data science.
Naturally, Statistics are useful in data science and there is no data science course out there that doesn't cover this useful framework in its syllabus. Every data scientist is expected to have a solid grasp of Statistics and use statistical methods in her work. However, relying on Stats exclusively is quite rare and often unproductive.
To sum up, Statistics is a great field that has a lot to offer to data science. However, data science is an inter-disciplinary field, borrowing from various areas, including but definitely not limited to Statistics. If you want to learn more about the various aspects of the data science craft and how you can enrich your know-how of it, feel free to check out my latest book, Data Science Mindset, Methodologies, and Misconceptions (Technics Publications). Then, even if you don’t share my view on this topic, at least you’ll be more aware of the complicated relationship between data science and Statistics.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.