Just wanted to wish you all Happy Holidays! It's been a great year and I appreciate your support through this blog. I won't be posting anything new in the next couple of weeks as I'll b traveling. Feel free to check out some of my older posts, though.
I hope your holidays are insightful, inspirational, and intriguing!
More important than remembering facts and methods related to data science problems is the trinity of inspiration, intuition, and imagination, with intelligence binding them all together. However, without inspiration, none of the stuff we know about data science is bound to grow much as our knowledge and know-how gradually crystallize and start giving in to entropy. So, I'd like to take a moment and remind everyone (including myself) the value of inspiration, even in a fairly technical field such as data science (I don't mention A.I. here because A.I. is its own source of inspiration, especially when one considers the applications of it).
So, what's your data science inspiration like? Where does it come from? What does it incentivize you towards? These are questions we need to ask ourselves from time to time, in order to make our learning of the field a sustainable process. The input of other data scientists is important in helping that but they may not always inspire us, especially after we grow out of the initial stages of our learning. This beginner’s mind although powerful is also fleeting and once it gives way to a more pragmatic view of data science, it is easy to lose our original enthusiasm for the field. That’s where inspiration comes in.
For me, the source of inspiration in data science is two-fold: first of all, it is my own research on the field, unbound by an academic agenda or a particular ideology (e.g. futurism). Such research is still disciplined but at the same time somewhat free, as in freedom (you can’t have research void of cost, unfortunately, even if that cost is just the time you dedicate to it). The other source of inspiration is mentoring, particularly students who are committed to learning data science through a structured and disciplined manner, such as the Thinkful courses on the subject. Naturally, I’d be happy to mentor other data science aspirants but so far this hasn’t taken place, for various reasons.
Beyond these, the educational material I create as well as the conferences I participate in can be a great source of inspiration too. However, these are not things that happen frequently enough so as to consider them as primary sources of inspiration, no matter how impactful they can be at times. In practice, they often act as conduits of inspiration, to a certain extent, something that’s also valuable. After all, all these aspects of my data science presence are interconnected and feed off each other.
What about you? What’s your inspiration for data science like? Does it come from a particular application, methodology, or educational material? How do you ensure that inspiration is part of your data science life?
The reality of data is often taken for granted, just like many things in data science. However, there is more to it than meets the eye and it's only after talking with other data professionals (particularly data architects) that this hierarchy of realities becomes accessible. Of course, this is not something you'll see in a data science book or video, but if you think about it it makes good sense. I've been thinking about it quite a bit before putting it down in words; eventually, all this helped me put things into perspective. Hopefully, it will do the same for you.
First of all, as the basest and most accessible reality of data, we have the values of a dataset. This involves all the numeric and non-numeric data that lives in the data frames we process. Naturally, this is usually referred to as data and it's the most fundamental entity we work with in every data science project. However, there is much more to all this than that since this data comes from somewhere else, through a higher abstraction of it.
This abstraction is the variables of the dataset. These are much more than just containers of the data values since they often represent pieces of information that represent characteristics we can relate to in the problem we are tackling. Also, the variables themselves have an inherent structure representing a pattern, which goes beyond the data values themselves. This is why Statistics is so obsessed with various metrics describing individual variables; in a way, these metrics reflect the essence of a variable and they are usually more important than the data itself.
Moreover, the relationships among all these variables are another level of reality regarding the data. After all, these variables are rarely independent of each other and the relationships among them are crucial for analyzing the data involved. This is what makes data generation a bit tricky since it's not as simple as creating data that follows the distribution of each variable involved. The relationships among the variables play a role in all this. That's why things like correlation metrics are important and help us analyze the data on a deeper level.
Furthermore, there is the structure of the dataset based on the inherent patterns and the reference variable. The latter is usually the target variable we are trying to predict. Naturally, the structure of the dataset is also relevant to the previous realities, particularly the one related to the relationship of the variables, since it influences the densities of the data. However, a higher-order is introduced to the data through the target variable, making this structure even more prominent. Whatever the case, it is by understanding this structure (e.g. through clustering, feature evaluation, etc.) that we manage to gain a deeper understanding of the essence of the data.
Finally, there are the multidimensional patterns that generated the data in the first place. This is the most important reality of the data since it's the one that defines the whole dataset and in a way transcends it. After all, a dataset is but a sample of all the possible data points that stem from a certain population. The latter is usually beyond reach and it can be limitless as new data usually becomes available. So, knowing these multidimensional patterns is the closest we can get to that population and making use of them is what makes a data science project successful.
Naturally, A.I. is involved in each one of these realities, usually as a tool for analyzing the data. However, it’s particularly relevant in the last level whereby it figures out these multidimensional patterns and manages to create new data similar to the original. Also, understanding these patterns well enables it to make more accurate predictions, due to the generalization of the data that it accomplishes.
Nevertheless, this 5-fold hierarchy of the realities of the data is useful for understanding a dataset, with or without A.I. methods. As a bonus, it enables us to gain a better appreciation of the heuristics available and helps us use them more consciously.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.