So, recently I decided to make a video on this topic, based on some things I've observed in data science candidates. The hope is that this may help them and anyone else who may be looking into becoming a more holistic data scientist, instead of just a data science technician. The video I made is now available online on O'Reilly and although it's a bit longer than others I've made (not counting the quiz ones), it's fairly easy to follow. Enjoy!
So, the 7th quiz video I've created is finally online on O'Reilly. This is the longest one so far spanning over 51 minutes, meaning there are lots of explanations for the various questions. It covers a bunch of topics, such as A/B testing, ANOVA, and various statistical tests. I put a lot of thought in this, much like you'd put a lot of thought in designing a data science experiment. Hopefully, you'll find it as useful and enjoyable as I did.
Note that just like other videos published on O'Reilly, you'll need to have an active account (even if it's a trial one), in order to view it in its entirety. As a bonus, you'll be able to view other videos as well as books available on that platform. Enjoy!
Although transparency is often viewed in relation to predictive analytics models, when it comes to data science there is another aspect of transparency that is also particularly important: the transparency of the data science work. This has to do mainly with how transparent the process followed is, as well as the models used and the results.
Nowadays it’s easy (perhaps too easy!) to build a predictive analytics model in complete obscurity, thanks to the wonder of deep learning. This way, you may be able to bring about a satisfactory result, without gaining a sufficient understanding of the data at play, or the quirks of the problem at hand. Of course, it's not just the data scientist to blame for this reckless behavior. Far from it. The root of the problem is managerial since we often forget that the data scientist will tend to follow the most economical course of action, to deliver a result, be it a data product or a set of insights, in the least amount of time. This is often due to the strict deadlines involved in a data science project and the all too frequent lack of understanding of the field, by the people managing the project.
There is more to a successful project than a high accuracy rate or an easily accessible model on the cloud. Oftentimes the problems tackled by data science are complex and have lots of peculiarities that deserve close attention if the problems are to be solved properly. Anyone can build a predictive analytics model nowadays, without having a good grasp of data science, thanks to all these 10-12 week boot camps that offer the most superficial knowledge humanly possible to the aspiring data scientists! Yet, if our expectations of the data scientists are equally shallow and we are willing to up with opaque models and pipelines, then we reap what we saw. That's why it's important to have good communication about these matters, going beyond the basics. Mentoring can also be a priceless aid in all this.
Fixing this fundamental issue requires more than just good communication and mentoring, however. We also need to opt for a transparent approach to data science. All aspects of the pipeline need to be explainable, even if the models used are black boxes, due to the performance requirements involved. The data scientists need to be able to communicate their work and findings, while we as managers need to do the same when it comes to requirements, domain knowledge, and other factors that may play a role in the project at hand. All this may not solve every issue with today’s obscure data science pipelines, but it is a good place to start.
Perhaps if we have transparency as a key value in our data science teams, we have a better chance of deriving true insights from the data available and bring about a more valuable result overall.
Beyond the play of words here, there is an important matter that needs to be addressed, since data science is becoming increasingly influential nowadays, in various aspects of our lives. Gone are the days when it was limited to the data science departments of certain companies; these days, the impact of data science transcends the boundaries of the organizations it serves. Take for example the data scientists working for large companies like Facebook and Google. The impact of their work influences a large number of people, even outside the companies themselves. Perhaps the range of this impact is hard to fathom even by the managers of these data science teams since it often has a lasting impact that's nearly impossible to gauge without sufficient data and the time required for this impact to fully manifest.
Ethics is a word that's used so much that has lost its meaning, or maybe it was never really properly defined in the first place. Also, with the impersonal aspects of ethics being formalized in particular codes of conduct, it has lost its essence since it has been reduced to a number of do's and don't, a set of guidelines which can be followed unconsciously and mechanically. However, ethics is the formal aspect of morality, which is founded in the values we follow. The latter is real and oftentimes comprehensible things that we express in our actions, oftentimes consciously. Values like honesty, diligence, and efficiently don't require a Master's in philosophy in order to comprehend, while the ethics of a modern information worker can be a bit more abstract and challenging to relate to. Values are something we have, whether we talk about them or not, and it's not too difficult to figure out what these are with a little introspection. However, even though values are a personal matter, they have a concrete effect on our work and in how we relate to the world. Good managers are aware of that and pay attention to the values of the candidates of the positions they wish to fill. The resume/CV is important but it’s not the only factor at play when hiring a professional.
Perhaps it's time to pay attention to this aspect of the craft more. Knowledge and know-how are becoming more easily accessible to everyone, particularly those who are willing to pay for that, an investment that is guaranteed to pay off. That's great, particularly for those who wish to enter this field even if their education is not aligned with this subject. Still, it's equally important to balance this aptitude with the moral strength that empowers us to deliver our data science work in a way that respects other people's privacy and doesn't abuse the information involved. At one point in our careers, it is natural to come into a crossroad where we need to either do is expected or do what is ethically right. The former is bound to be a more tempting option, at least financially, while the latter may be void of any direct benefit. Having a solid set of positive values may help us make the right choice instead of trading the long-term benefit of the many for the short-term gain of the few.
With everyone in A.I. feeling the need to have an opinion or even a stance on Artificial General Intelligence (AGI), we often neglect the source of this concept. Namely, the well-rounded intelligence that characterizes a human being, having all kinds of smarts. The latter I refer to as Natural General Intelligence (NGI) and someone can argue that it's as important if not more important than AGI, at least in this point in time, particularly to data science professionals.
But isn’t this kind of intelligence another name for genius? Not necessarily. NGI is modeled after the human being in general even if its artificial counterpart (AGI) is often linked to super-intelligence, a kind of supergenius that may characterize an A.I. that has developed this level of intelligence. Still, it is possible to have NGI without being a modern Leonardo DaVinci or a Benjamin Franklin.
Natural General Intelligence is all about enabling your mind to develop in different aspects, not merely the ones that you need for your vocation or the ones that were essential for your survival so far. This idea is not new and has been popular during the Renaissance. Even today we use the term "Renaissance Man" to refer to the individual who is well-rounded in his or her life and can be good at different things. In this era of overspecialization, this seems to be a Utopian endeavor, at least to some people. In reality, however, it isn't. If you want to learn a musical instrument, for example, there are plenty of courses and books you can leverage, while there are even music instructors who can teach you over the internet. As for the instruments themselves, they are far more affordable than they used to be while for certain instruments, the prices continue to drop due to high demand. However, more important than developing one’s musical aptitude is the growth of one’s emotional intelligence (EQ), particularly interpersonal skills.
What does all this have to do with data science? Well, in data science it’s easy to overspecialize too (e.g. in Machine Learning, Data Engineering, NLP, etc.). However, this creates artificial barriers which may render communication with other data professionals more challenging. Of course, more often than not these issues are alleviated through a competent data science lead or a manager with sufficient data science understanding. Still, if you as a data science professional can mitigate the need for external intervention when it comes to collaborating with others, that’s definitely a plus. Not just in terms of smoothing the professional relationships involved, but also in terms of business value. Stand-alone professionals are very sought after since such people tend to be (or quickly become) assets. In time, these professionals can grow into versatilists and/or assume leadership positions.
From all this, it is hopefully clear that Natural General Intelligence is more tangible and significantly more feasible than any other kind of advanced intelligence capable of yielding value in an organization. What's more, an individual with NGI is bound to be more relate-able and accountable, rendering the whole team he/she belongs to a more functional unit. Perhaps such a goal is more beneficial than the blind pursuit of some exotic kind of A.I. that can solve all of our problems. The latter is intriguing and worth investigating, but I wouldn't bet on it benefiting the average Joe any time soon!
In the most venerable of sciences, Physics, there are two closely linked concepts, that of work and that of energy. Work is the result of a force applied over a given distance, while energy is often seen as the result of work. However, energy takes a variety of forms, which enables us to produce work through the use of it, be it through a preexisting form (e.g. uranium and thorium) or some man-made form (e.g. a battery). This fundamental idea of the relationship between work and energy, which we often take for granted, is something that applies to data science as well, by substituting energy for value.
Value is sometimes considered as the 5th V of Big Data (the other four being Volume, Velocity, Variety, and Veracity), something that is quite inaccurate though since value is a fundamental characteristic of information, not a particular kind of data. Information, however, can be found even in relatively small datasets (which were considered large once, before the era of big data), so calling it a characteristic of big data can be misleading. This misconception doesn't take away any value from the idea of value though, which is often a value instilled in many data scientists, particularly those who go beyond the techniques and methods. These data scientists penetrate the essence of the craft, through the development of the data science mindset, which is the most valuable aspect of the field.
Value is something that concerns business people too, however, since it is one of the outcomes of a data science project, which ideally can translate into increased revenue, be it via the development of a new product or by making a business process more efficient. Also, value can enable an organization to expand its scope, know its customers better (KYC), and liaise with other organizations more effectively. This value, which often takes the form of insights, is at the core and oftentimes at the end of the data science pipeline.
Value, however, can take the form of a product, such as an API that automates a particular evaluation process or a prediction. Although the technology behind such a product is nothing spectacular (APIs have existed for a while now and they are fairly straight-forward for a software engineer to develop), the data science part of that product is what brings about the real value in such an API. Without a data science engine behind it, an API is bound to be more of an ETL tool which although still valuable, it's not of the same caliber of data science-powered APIs.
Value in data science is often found in the information distilled from the data, particularly through a predictive analytics model. Elements of it, however, are already encountered in the data discovery stage of the pipeline, where the data scientist evaluates the features at hand and the metadata available. This is often conducted through the creation of data models, which is why it is part of the data modeling part of the pipeline. I talk about all this in detail in the Data Science Modeling Tutorial, available on the O'Reilly (formerly known as Safari) platform.
Value in data science is a big topic and if I were to continue this article would be irksomely long. It would be best if I continue this in another article, or even a series of articles, in the weeks to come. Cheers!
The knowledge vs. faith conundrum has been a philosophical debate for eons, yet it usually is geared towards abstract matters, such as life after death. So, how does this apply to a pragmatic field such as data science? Well, contrary to what many people think, most data science practitioners often rely on faith to a great extent, when dealing with data science matters. But why is that?
Unfortunately, most people learning the craft have a strict time table to keep, so they don't have a chance to go in depth on the material covered. This increasingly severe temporal limitation is also coupled with other factors, such as the plethora of "cookbooks" on the topic. Not to be confused with actual cookbooks, comprising of various recipes, oftentimes original tried and tested dishes developed by experienced chefs; these cookbooks are fine and probably have a bigger bang for your buck, compared to the technical cookbooks that are basically a bunch of methods and functions, usually in a popular programming language, organized by someone who oftentimes doesn't even understand them. If you rely mainly on such sources of knowledge, you are basically putting your faith in these people and creating gaps in your understanding of the craft.
So, if you obtain technical knowledge quickly or from a source that doesn't go much in depth, it is unlikely to truly know data science. That's not to say that you shouldn't read books; far from it. Books are useful but no matter how good they are, the best way to learn something remains the empirical approach. Going under the hood of the methods involved, implementing methods from scratch and even experimenting with your own ideas, are all good ways to learn something in more depth and remember it for longer periods of time. Also, through empirical knowledge of the craft, you are more confident about what you know and oftentimes more aware of the boundaries of your knowledge.
There is room for faith in our field, as for example when you trust what your data science lead/director tells you, when you accept advice from a mentor, and when you rely on the know-how of an academic paper written by someone who knows data science in-depth. However, it's good to balance it with empirical knowledge to the extent your time allows. Perhaps in abstract matters, it's hard to obtain empirical knowledge, but on things that you can test yourself, the only limitations are man-made ones. Are you willing to transcend them?
There are many mistakes that can be made in data science, many of which can go unnoticed for a while. The reason is that unlike coding bugs, these mistakes don't throw an error or an exception, making them harder to spot and fix, as a result. In my view, the biggest such mistake is that of thinking that one aspect of data science is so significantly better than the others that the latter don't matter much. I used to think like that back in PhD days (my thesis was on Machine Learning and heuristics) but fortunately, I discovered the error of my thinking and started broadening my perspective on this matter, something I continue to do as I learn more about this fascinating field.
Let's look into this more closely. For starters, there are several frameworks or tool-kits available in data science today, ranging from Statistics to Machine Learning, and lately, A.I. based models. All of them have their own set of advantages as well as limitations. Many Machine Learning models, for example, particularly A.I. based ones (mainly ANNs) are very hard to interpret and are often referred to as black boxes. Stats models, on the other hand, may be easy to interpret, but they may not be as accurate, while they tend to have a number of assumptions which may not always hold true. That's why claiming that one of these frameworks or tool-kits is the best one at the expense of others is a very shaky position.
However, with all the hype around the latest and greatest Deep Learning methods (and other A.I. based models used in Data Science), it's difficult to argue against this position. Also, with Statistics having such a good reputation in academia and proven applicability across different domains, it's also hard to argue that it's not as good a framework. This may be good in a way since it keeps us humble, but it may also obstruct progress. How can you have the nerve to put forward something new if it doesn't comply with what is considered "the best" or if it doesn't comply with the traditional approaches to data learning, such as Statistical Learning?
I'm not claiming to have a solution to this conundrum, by the way, and perhaps it's not something that can be answered simply. However, this kind of riddles that plague the data science field are what can be good food for thought and bring about a sense of genuine wonder about the prospects and the future of data science. Maybe when someone asks us what the best framework of data science is it's better to say "I don't know" and consider using different ones in tandem, instead of flocking into this or the other group of people who have made up their minds about this, and who are unlikely to ever change it. After all, open-mindedness is something that never gets old, at least not in a truly scientific field.
Being open-minded is a key trait of any scientist, since the beginning of Science. The scientific method is basically a practice that relies on open-mindedness, focusing on testing a hypothesis based on the evidence at hand. However, nowadays there is a trend towards a heretic behavior (in lack of a better word) when it comes to the science of data, as well as the application of A.I. in it.
Open-mindedness is not just being open about the results of an experiment though. That’s easy. Being open to other people’s ideas and beliefs is also important. It’s easy to dismiss some people, especially those writing about this matter, even though they lack the training you may have on the field. Still, those people may have some interesting insights, which they often express in their articles. You don’t have to agree with them, in order to gain from this, expanding your perspective. However, dismissing an article because it makes use of this or the other term (which in your opinion is not that relevant to the topic they tackle) is closed-minded.
That’s not to say that we should accept everything we read, however. Some of the material out there is of low informational value and can be biased towards this or the other technology, for various reasons. That’s normal since the field of data science (as well as A.I. to some extent) is closely linked to the business world and is influenced by the dynamics of the markets of tools and frameworks related to data analytics.
So, what do we do about all this? For starters, we can read an article before we dismiss it as irrelevant or otherwise problematic. Also, if we don’t agree about something with the author, we can construct arguments against that point and express them without attacking the other person. There are people who are incredibly toxic to the field and pose a threat to the field, by propagating their erroneous beliefs, but fortunately, these are few. Also, they are probably beyond salvation, since they have too large a following to ever question their beliefs. Still, by going against their propaganda, we can still help the people who haven’t made up their minds yet on the topic.
Perhaps that’s why the most important thing you can learn about data science and A.I. is to have a mindset that is congruent to your development as a professional, always maintaining an open mind. Just because there are fanatics in this field who are getting paid way more than they should and maintain a large following due to their charisma, it doesn’t mean that this is the best way to go. It’s not easy to be open-minded in a place where fanaticism thrives, but in the long run, it’s a viable strategy. After all, data science is here to stay, in one form or another, while the views on it that are now popular are bound to change.
It's funny how when you think you know something, you often discover that you don't know it that much. This is particularly the case in data science, a field that holds more mystery than most people think. For example, a great deal of heuristics and models are based on the idea of similarity and there have been developed several metrics to gauge the latter. Many of them are based on distances but others are more original, in various ways.
During my exploration of the hidden aspects of data science (my favorite hobby), I came across the idea of a similarity metric that is not subject to dimensionality constraints, like all of the distance-based ones, while also fast and easy to calculate. Also, this is something original that I haven't encountered anywhere else and I've looked around quite a bit, especially when I was writing the book "Data Science Mindset, Methodologies and Misconceptions" where I talk about similarity metrics briefly.
Anyway, I cannot explain it in detail here because this metric makes use of operators and heuristics that are themselves original, part of my new frameworks of data analytics. Let's just say that it makes use of Math in a way that seems familiar and comprehensible, but has not been used before. Also, it yields values in [0, 1], with 1 being completely similar and 0 being completely dissimilar. The idea is to find a way to gauge similarity from different perspectives and combine the result, something that would unfortunately only work if the data is properly normalized. Given that all conventional ways of normalizing data are inherently flawed, this metric is bound to work only in properly normalized data spaces. Because such spaces are more or less balanced (even if they have outliers), the average similarity of all the data points in them is always around 0.5 (neutral similarity), something that makes the metric very easy to interpret.
As with other metrics and heuristics, it's not they themselves that are the most important thing, but the doors they open, revealing new possibilities (e.g. a new kind of discernibility metric). That's why I found the picture of the fractal above quite relevant since it is all about self-similarity, a concept that led us to the discovery of a new kind of Mathematics related to Chaos. Interestingly, even with such advanced knowledge, we are unable to fully comprehend the chaos that reigns modern A.I. systems, something that has its own set of problems. So, I ask you to wonder for a moment how much better A.I. would be if it were developed using comprehensible heuristics, making it transparent and interpretable. Perhaps its thinking patterns wouldn't be as dissimilar to ours and we wouldn't see it as much of a threat.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.