“Wait a minute! Isn’t data science all about cool machine learning models, number-crunching, artificial intelligence methods, and big data?” I can hear some people saying. Well, it is all that, but the one thing that binds all these different aspects of data science together is domain knowledge, or in other words, context. You may be adept in cleaning, structuring, and modeling the data at hand but if you are missing the bigger picture and how all this data (and its distillation) relates to the stakeholders of the project, then you are just an analyst! Data science is not divorced from the real world, even if in its most esoteric aspects, it may seem quite alienating to the average Joe. Data science is a business framework, among other things, and as such it constitutes an integral part of business processes. Without the latter to provide a sense of perspective and some sort of objective to the data at hand, data science is reduced to an intellectual endeavor, like modern philosophy. Even if there is value in the latter too, but it’s not what data science is about.
Context in data science manifests on various levels. At the larger scale, it’s about its relevance to the end-user and the stakeholders of the project. Because no matter how brilliant a data model is, it is wrong, as it is merely an abstraction of reality, though, if the model is crafted in a way that it provides value to the end-user, it can be useful (as George Box would put it). This value stems from the context it takes into account. Yet, context also manifests in the way the data is engineered and distilled into information. For example, there are a number of ways to do dimensionality reduction (i.e., make the number of features smaller, while in some cases making these features more compact). If you follow a recipe book blindly, you’ll probably resort to PCA, ICA, or some other off-the-shelf method. However, if you look at the problem more closely, you may employ a different strategy, particularly if you have label data at your disposal. Such additional information may impact the way the feature data is perceived and make a feature filtering approach more relevant, for example.
Perhaps it would be prudent to put data science into perspective, rather than focus on its techniques and tools only. Being mindful of the context of every part of data science pipeline is a great way to accomplish that. After all, just like every applied science, data science is geared towards people, not abstract entities to populate theories and research articles. The latter are useful, but the former are what provide our craft with meaning and business value.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.