There is a lot of unstructured data out there. Many people view it as untapped potential, and they are right. There is a lot of signals out there, waiting to be harnessed by the data scientists who get to them. However, most of the data where these signals dwell is unstructured, or semi-structured (there is some structure to them but it’s not consistent). This leads some people to believe that structuring it will instantly make the data more valuable. This view is quite debatable, however, and is worth exploring further, before it brings about unrealistic expectations of what data science can do.
Structuring data is part of the data science process. Before we can feed it to a model, we need to get the data into the form of a matrix (if all the data is of the same type) or a data frame (whenever we have various types in the dataset). However, the fact that structuring data is necessary for the mining of the information in it (usually in the form of insights), does not make it a sufficient condition for that. In other words, we have to structure the data, but this doesn't guarantee anything. There have been many times when upon training various models, from different frameworks, things don’t seem to pan out. The performance is mediocre, the results are not actionable, and the whole thing is labeled as a failure of sorts. I do not mean to dismay anyone, but it’s healthy to be aware of this possibility, since it’s not often shown in data science books or tutorials. People like to talk about the success stories, leading to a false understanding and unrealistic expectations.
For the data to be valuable, it needs to have a strong signal in it. This means that even by just looking at it, you can tell that there is something there that given enough time and effort, you would be able to find yourself. In this case, data science facilitates the process of mining that signal, since no-one has the patience or the resources to go through a data stream on there own, no matter how motivated they are. In this case, data science is bound to be successful, since it accelerates the process of turning this information-rich data into actual information, or even knowledge. However, the structure of the data is not so relevant in this case. Even if the data is in a JSON or raw text format, for example, it can still be useful, since it’s not too difficult to generate features that penetrate this nebulous form and manage to encapsulate the essence of it, in a form that can easily fit into a database table (albeit a very large one usually).
So, it is important to exercise discernment in this matter. Surely structured data may be more appealing for a data scientist, as it means less tedious work for her, but it doesn't guarantee anything of value. Besides, the process of structuring the data (aka data engineering) can be insightful too, as it involves some data exploration. Data exploration may not always accelerate the structuring of the data, but it definitely helps you understand it better and make more informed choices about the whole data science process (including structuring). After all, shortcuts in the process may save you some time, but if you know what you are doing, you can definitely do without them, saving your organization some money in the process, since automated data structuring is not free. The choice is yours.
Many people argue that data science’s main purpose, particularly in a business setting, is to mine and deliver insights. Contrary to data products (which is another data science deliverable type), insights are fairly straight-forward and require little software development (something often outsourced to the dev team). However, their value is something that is the subject of debate, since few insights are actually used in practice, in real-world projects.
An insight is generally some non-trivial conclusion that stems from rigorous analysis of a data stream, be it with A.I. techniques (e.g. a deep learning network), some other machine learning methodology (e.g. an unsupervised learning system), or even some statistical process (e.g. a chi-square test). By definition, it is not something that you can pinpoint by just plotting the data, or calculating some superficial metric, like the mean, or standard deviation (which are fine by themselves, but insufficient for generating useful insights).
It would be good to differentiate between the various aspects of the value of an insight. First of all, there is the inherent value of the insight. This is in essence a signal in the data analyzed, or some interpretation of it. This kind of value is useful primarily for the data scientist and other people involved in the project, in a hands-on way. If the data science project is related to research, this kind of insight can be the basis of a publication. However, an insight that has merely innate value is often not enough.
Another aspect of the value of an insight is its commercial application. This is significantly more important for the majority of data science project. The reason is that someone is paying for the project and it’s this kind of valuable insights that eventually bring about a positive ROI for the project. The data scientist may not necessarily value the commercial aspect of the insights he delivers, but the project manager definitely does, as well as other stakeholders of the project.
Finally, there is the practical value of the insight. Whether the insight has commercial value or not, it may enable the development of something tangible, like a data product, or some in-depth understanding of the problem at hand. This kind of value is conducive to a new cycle in the data science process, something that is bound to bring about new insights, yielding additional value.
Whatever the value of the insights, it is important to remember that one’s work shouldn’t be judged entirely by them. Surely it’s great if you can produce something actionable, or something that sheds light to the problem investigated, but if the data streams available are as noisy as the screen of a TV that’s not tuned to a network, then there is not much you can do with them. After all, the rule that many software developers have “garbage in, garbage out” (GIGO) is applicable to data science as well. If you want valuable insights, you need data streams that have some useful signal(s) in them, otherwise you are just wasting your time.
What are your insights on this matter?
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.