The reality of data is often taken for granted, just like many things in data science. However, there is more to it than meets the eye and it's only after talking with other data professionals (particularly data architects) that this hierarchy of realities becomes accessible. Of course, this is not something you'll see in a data science book or video, but if you think about it it makes good sense. I've been thinking about it quite a bit before putting it down in words; eventually, all this helped me put things into perspective. Hopefully, it will do the same for you.
First of all, as the basest and most accessible reality of data, we have the values of a dataset. This involves all the numeric and non-numeric data that lives in the data frames we process. Naturally, this is usually referred to as data and it's the most fundamental entity we work with in every data science project. However, there is much more to all this than that since this data comes from somewhere else, through a higher abstraction of it.
This abstraction is the variables of the dataset. These are much more than just containers of the data values since they often represent pieces of information that represent characteristics we can relate to in the problem we are tackling. Also, the variables themselves have an inherent structure representing a pattern, which goes beyond the data values themselves. This is why Statistics is so obsessed with various metrics describing individual variables; in a way, these metrics reflect the essence of a variable and they are usually more important than the data itself.
Moreover, the relationships among all these variables are another level of reality regarding the data. After all, these variables are rarely independent of each other and the relationships among them are crucial for analyzing the data involved. This is what makes data generation a bit tricky since it's not as simple as creating data that follows the distribution of each variable involved. The relationships among the variables play a role in all this. That's why things like correlation metrics are important and help us analyze the data on a deeper level.
Furthermore, there is the structure of the dataset based on the inherent patterns and the reference variable. The latter is usually the target variable we are trying to predict. Naturally, the structure of the dataset is also relevant to the previous realities, particularly the one related to the relationship of the variables, since it influences the densities of the data. However, a higher-order is introduced to the data through the target variable, making this structure even more prominent. Whatever the case, it is by understanding this structure (e.g. through clustering, feature evaluation, etc.) that we manage to gain a deeper understanding of the essence of the data.
Finally, there are the multidimensional patterns that generated the data in the first place. This is the most important reality of the data since it's the one that defines the whole dataset and in a way transcends it. After all, a dataset is but a sample of all the possible data points that stem from a certain population. The latter is usually beyond reach and it can be limitless as new data usually becomes available. So, knowing these multidimensional patterns is the closest we can get to that population and making use of them is what makes a data science project successful.
Naturally, A.I. is involved in each one of these realities, usually as a tool for analyzing the data. However, it’s particularly relevant in the last level whereby it figures out these multidimensional patterns and manages to create new data similar to the original. Also, understanding these patterns well enables it to make more accurate predictions, due to the generalization of the data that it accomplishes.
Nevertheless, this 5-fold hierarchy of the realities of the data is useful for understanding a dataset, with or without A.I. methods. As a bonus, it enables us to gain a better appreciation of the heuristics available and helps us use them more consciously.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.