There are many mistakes that can be made in data science, many of which can go unnoticed for a while. The reason is that unlike coding bugs, these mistakes don't throw an error or an exception, making them harder to spot and fix, as a result. In my view, the biggest such mistake is that of thinking that one aspect of data science is so significantly better than the others that the latter don't matter much. I used to think like that back in PhD days (my thesis was on Machine Learning and heuristics) but fortunately, I discovered the error of my thinking and started broadening my perspective on this matter, something I continue to do as I learn more about this fascinating field.
Let's look into this more closely. For starters, there are several frameworks or tool-kits available in data science today, ranging from Statistics to Machine Learning, and lately, A.I. based models. All of them have their own set of advantages as well as limitations. Many Machine Learning models, for example, particularly A.I. based ones (mainly ANNs) are very hard to interpret and are often referred to as black boxes. Stats models, on the other hand, may be easy to interpret, but they may not be as accurate, while they tend to have a number of assumptions which may not always hold true. That's why claiming that one of these frameworks or tool-kits is the best one at the expense of others is a very shaky position.
However, with all the hype around the latest and greatest Deep Learning methods (and other A.I. based models used in Data Science), it's difficult to argue against this position. Also, with Statistics having such a good reputation in academia and proven applicability across different domains, it's also hard to argue that it's not as good a framework. This may be good in a way since it keeps us humble, but it may also obstruct progress. How can you have the nerve to put forward something new if it doesn't comply with what is considered "the best" or if it doesn't comply with the traditional approaches to data learning, such as Statistical Learning?
I'm not claiming to have a solution to this conundrum, by the way, and perhaps it's not something that can be answered simply. However, this kind of riddles that plague the data science field are what can be good food for thought and bring about a sense of genuine wonder about the prospects and the future of data science. Maybe when someone asks us what the best framework of data science is it's better to say "I don't know" and consider using different ones in tandem, instead of flocking into this or the other group of people who have made up their minds about this, and who are unlikely to ever change it. After all, open-mindedness is something that never gets old, at least not in a truly scientific field.
Being open-minded is a key trait of any scientist, since the beginning of Science. The scientific method is basically a practice that relies on open-mindedness, focusing on testing a hypothesis based on the evidence at hand. However, nowadays there is a trend towards a heretic behavior (in lack of a better word) when it comes to the science of data, as well as the application of A.I. in it.
Open-mindedness is not just being open about the results of an experiment though. That’s easy. Being open to other people’s ideas and beliefs is also important. It’s easy to dismiss some people, especially those writing about this matter, even though they lack the training you may have on the field. Still, those people may have some interesting insights, which they often express in their articles. You don’t have to agree with them, in order to gain from this, expanding your perspective. However, dismissing an article because it makes use of this or the other term (which in your opinion is not that relevant to the topic they tackle) is closed-minded.
That’s not to say that we should accept everything we read, however. Some of the material out there is of low informational value and can be biased towards this or the other technology, for various reasons. That’s normal since the field of data science (as well as A.I. to some extent) is closely linked to the business world and is influenced by the dynamics of the markets of tools and frameworks related to data analytics.
So, what do we do about all this? For starters, we can read an article before we dismiss it as irrelevant or otherwise problematic. Also, if we don’t agree about something with the author, we can construct arguments against that point and express them without attacking the other person. There are people who are incredibly toxic to the field and pose a threat to the field, by propagating their erroneous beliefs, but fortunately, these are few. Also, they are probably beyond salvation, since they have too large a following to ever question their beliefs. Still, by going against their propaganda, we can still help the people who haven’t made up their minds yet on the topic.
Perhaps that’s why the most important thing you can learn about data science and A.I. is to have a mindset that is congruent to your development as a professional, always maintaining an open mind. Just because there are fanatics in this field who are getting paid way more than they should and maintain a large following due to their charisma, it doesn’t mean that this is the best way to go. It’s not easy to be open-minded in a place where fanaticism thrives, but in the long run, it’s a viable strategy. After all, data science is here to stay, in one form or another, while the views on it that are now popular are bound to change.
It's quite enjoyable and insightful to learn about A.I., particularly NLP, as well as other data science related topics. However, with so vast a knowledge-base related to this topic, it's equally easy to forget a lot of what you learn. Enter this quiz that can help you recap the most important points of Natural Language Processing in a fun way, without the stress the often accompanies quizzes in educational institutes, for example. This is an experimental kind of video so it may still need some refinement, something I'm going to look into in the near future. In the meantime, you can check out this quiz video on NLP here. Enjoy!
Note that this video is published on the Safari platform, O'Reilly's online library of sorts, where various publishers can exhibit their creations in digital format. In order to take full advantage of this, however, you'll need to have an account since it's a subscription-based site.
It's funny how when you think you know something, you often discover that you don't know it that much. This is particularly the case in data science, a field that holds more mystery than most people think. For example, a great deal of heuristics and models are based on the idea of similarity and there have been developed several metrics to gauge the latter. Many of them are based on distances but others are more original, in various ways.
During my exploration of the hidden aspects of data science (my favorite hobby), I came across the idea of a similarity metric that is not subject to dimensionality constraints, like all of the distance-based ones, while also fast and easy to calculate. Also, this is something original that I haven't encountered anywhere else and I've looked around quite a bit, especially when I was writing the book "Data Science Mindset, Methodologies and Misconceptions" where I talk about similarity metrics briefly.
Anyway, I cannot explain it in detail here because this metric makes use of operators and heuristics that are themselves original, part of my new frameworks of data analytics. Let's just say that it makes use of Math in a way that seems familiar and comprehensible, but has not been used before. Also, it yields values in [0, 1], with 1 being completely similar and 0 being completely dissimilar. The idea is to find a way to gauge similarity from different perspectives and combine the result, something that would unfortunately only work if the data is properly normalized. Given that all conventional ways of normalizing data are inherently flawed, this metric is bound to work only in properly normalized data spaces. Because such spaces are more or less balanced (even if they have outliers), the average similarity of all the data points in them is always around 0.5 (neutral similarity), something that makes the metric very easy to interpret.
As with other metrics and heuristics, it's not they themselves that are the most important thing, but the doors they open, revealing new possibilities (e.g. a new kind of discernibility metric). That's why I found the picture of the fractal above quite relevant since it is all about self-similarity, a concept that led us to the discovery of a new kind of Mathematics related to Chaos. Interestingly, even with such advanced knowledge, we are unable to fully comprehend the chaos that reigns modern A.I. systems, something that has its own set of problems. So, I ask you to wonder for a moment how much better A.I. would be if it were developed using comprehensible heuristics, making it transparent and interpretable. Perhaps its thinking patterns wouldn't be as dissimilar to ours and we wouldn't see it as much of a threat.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.