There is a lot of unstructured data out there. Many people view it as untapped potential, and they are right. There is a lot of signals out there, waiting to be harnessed by the data scientists who get to them. However, most of the data where these signals dwell is unstructured, or semi-structured (there is some structure to them but it’s not consistent). This leads some people to believe that structuring it will instantly make the data more valuable. This view is quite debatable, however, and is worth exploring further, before it brings about unrealistic expectations of what data science can do.
Structuring data is part of the data science process. Before we can feed it to a model, we need to get the data into the form of a matrix (if all the data is of the same type) or a data frame (whenever we have various types in the dataset). However, the fact that structuring data is necessary for the mining of the information in it (usually in the form of insights), does not make it a sufficient condition for that. In other words, we have to structure the data, but this doesn't guarantee anything. There have been many times when upon training various models, from different frameworks, things don’t seem to pan out. The performance is mediocre, the results are not actionable, and the whole thing is labeled as a failure of sorts. I do not mean to dismay anyone, but it’s healthy to be aware of this possibility, since it’s not often shown in data science books or tutorials. People like to talk about the success stories, leading to a false understanding and unrealistic expectations.
For the data to be valuable, it needs to have a strong signal in it. This means that even by just looking at it, you can tell that there is something there that given enough time and effort, you would be able to find yourself. In this case, data science facilitates the process of mining that signal, since no-one has the patience or the resources to go through a data stream on there own, no matter how motivated they are. In this case, data science is bound to be successful, since it accelerates the process of turning this information-rich data into actual information, or even knowledge. However, the structure of the data is not so relevant in this case. Even if the data is in a JSON or raw text format, for example, it can still be useful, since it’s not too difficult to generate features that penetrate this nebulous form and manage to encapsulate the essence of it, in a form that can easily fit into a database table (albeit a very large one usually).
So, it is important to exercise discernment in this matter. Surely structured data may be more appealing for a data scientist, as it means less tedious work for her, but it doesn't guarantee anything of value. Besides, the process of structuring the data (aka data engineering) can be insightful too, as it involves some data exploration. Data exploration may not always accelerate the structuring of the data, but it definitely helps you understand it better and make more informed choices about the whole data science process (including structuring). After all, shortcuts in the process may save you some time, but if you know what you are doing, you can definitely do without them, saving your organization some money in the process, since automated data structuring is not free. The choice is yours.
Your comment will be posted after it is approved.
Leave a Reply.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.