Lately, I’ve been thinking a lot about what information is in a data science context, partly because of a couple of projects I’m involved in, and partly because that’s what I enjoy thinking about in my leisure time. After all, there are people who care about data science in a deeper way, as it’s more than a profession for them, something that commands a certain level of dedication that others may not comprehend. As I’m one of those people, I can attest to the unique beauty of the field and the qualities of it that keep it evergreen and ever-interesting.
For the longest time, I was under the impression that it’s the data points or the features that contain the information in a dataset. After all, that’s what most data science sources imply and something that makes some intuitive sense. However, lately I’ve experimented with new algorithms that can generate new data points and new features, while others manage to reduce the number of data points without information loss (aka intelligent sampling) or summarize the same information in a smaller number of features through the use of usually non-linear combinations of the original features (aka feature fusion). In all these cases, there isn’t any new information generated and there isn’t any significant information loss. What’s more, if you can effectively replace the original dataset with synthetic data, without losing any information then the claim that information is basically the original data doesn’t hold any water. In other words, information exists with or without the data at hand, since the same information can be expressed oftentimes more eloquently with a more succinct set of data points or features. Much like the essence of an ice cube is not the exact molecules of the water in it, but the fact that it consists of water and has a certain shape, dictated by the ice cube tray mold.
From all this, it follows that what we need is an information-rich dataset, i.e. a dataset that contains useful information without excessive data points or excessive features. Of course, it’s not always easy to perform the transformations required to accomplish this, but it is feasible and most modern A.I. systems are proof of that. Whether, however, this black box approach is the most effective way to accomplish this information distillation is something that needs to be investigated. In my view, looking into this sort of matters and having this perspective is far more important than all the technical know-how about the latest and greatest machine learning system, know-how that is oftentimes superficial when not accompanied by the data science mindset. The latter is something super important which however cannot be described in a simple blog article. I’ve written a book trying to explain it and even that may not have done it justice.
Anyway, pondering about these things may seem a bit philosophical, but if this pondering is transformed into concrete and actionable insights that can help improve existing data science methods or spawn new ones, then it’s probably more than just theoretical. Perhaps it’s this pondering that help keeps data science fresh in our minds, preventing it from becoming a mechanical process void of any life and inspiration. After all, just because many people have forgotten about what lured them to data science, it doesn’t mean that this is the only course of action. Someone can practice data science and still be enthusiastic about it while maintaining a sense of creative curiosity about the subject. It’s all a matter of perspective...
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.