Bugs are terrible and high-level mistakes are even worse! Yet, most data science books out there don't say much about them, or how we can deal with them when they arise in our data science work. Reading these books may give someone the impression that everything in the data science world is smooth and filled with rainbows, something that is (sadly) far from the truth! So, instead of being in denial about this very important matter, we can choose to tackle it calmly and intelligently. This is why I made this video, which is now available on Safari Books Online for everyone interested in having a better and more bug-free data science life. Enjoy!
Why is it important to ask questions in data science? How can you answer these questions? Where do hypotheses fit in? How does all that relate to the know-how you have? So many questions! For some answers to them, feel free to check out my latest video on Safari Books Online. As always, your feedback is always welcome...
This post is inspired by Joel Grus’s latest blog post on an interview of his and his showcasing of his Tensorflow know-how during it. Now, this interview was probably imaginary since I doubt anyone would be that foolish in an interview, but he is not afraid to make fun of himself to get a point across, something that is evident in most of his writings (including his book, Data Science from Scratch). I have no intention to promote him, but I find that his whole approach to data science is very fox-like, so it is only natural that he is mentioned in my blog!
In this dialectic post of his, Joel describes an interviewee’s efforts to come across as knowledgeable and technically adept, as he tries to solve the Fizz Buzz problem he is asked to whiteboard, using this Deep Learning package, along with Python. Of course, why someone would waste time asking him to solve such a simple problem is incomprehensible to me, but perhaps it’s for an entry level position or something. Still, if this were a data science position, the Fizz Buzz problem would be highly inappropriate as it has nothing to do with programming that’s relevant to data science. Joel goes on in his blog post to describe the great lengths he has to go to in order to get a basic neural network trained and deployed so that he can solve the problem, though even though he does nothing wrong (technically), his approach fails to yield the desired output and he fails the interview. That’s not to say that he or his tools are bad, but clearly illustrates the point he’s trying to make: advanced techniques don’t make one a good data scientist!
This is an issue with many data scientists today who have gotten intoxicated with the latest and greatest A.I. tech that’s found its way into data science. The tech itself is great and the tools it has been implemented with are also great. However, just because you can use them, it doesn’t make you a good data scientist. So what gives? Well, even though Deep Learning is a great framework for tackling tough data science problems, it fails miserably in the simpler ones, which are also quite common. Perhaps it’s the lack of data points, the fact that it takes a while to configure properly, or some other reason that depends on the problem at hand. Whatever the case, as data scientists we ought to be pragmatic and hands-on. Just because we know an advanced Machine Learning technique, it doesn’t mean that we should use it to solve all of the problems we are asked to solve. Sometimes we just need to come up with some simple heuristic and work with that.
There is an old saying that illustrates this issue the Joel describes in that post: killing a mosquito with a cannon. Yes, you may actually succeed in killing the poor insect with your fancy artillery weapon, but is that really cost-effective? Nowadays many data scientists go with the Deep Learning option because someone convinced them that it’s the best option out there in general, without sitting down for a minute and thinking if it’s the best option for the particular problem they are facing. Data science is not as simple and straight-forward an approach to problem-solving as some people make it out to be. So let’s get real for a minute and tackle problems like engineers, opting for a simple solution that works, before calling the cavalry for A.I. to help us. Being super adept may be appealing, but we first need to be adept at what we do by employing a down-to-earth approach that just works, before opting for improvements through more advanced models.
Intuition is probably the most undervalued quality in data science, even though it has played a prominent role in Science, throughout the years. Even in mathematics, intuition is very important, since it illuminates avenues of research or novel ways of tackling a particular problem. However, even though intuition is of high regard in most scientific fields today, in data science it is not valued much, especially lately, when the emphasis is on the engineering and modeling aspects of the field.
Data science involves a lot of nitty gritty work, which is why Kaggle competitions are a bit misleading when it comes to introducing the field to newcomers. Despite their practical value, they emphasize one particular aspect of data science (the most interesting one), creating the belief that it’s all about clever feature engineering and models. So, when someone goes deeper into the field they tend to shift to the other extreme and focus on the data engineering aspects of it, which constitute well over 80% of the actual work a data scientist does. Preparing the data, formating it, playing around with the variables and turning them into features, are parts of the data engineering part of the pipeline that require more grit than intuition or even intelligence. That’s all fine, but many people forget to get back to the bigger picture afterwards: what data science is all about. If you are thinking “insights” at this point, you are on the right track. However, to bridge the data to these insights, we need some intuition.
We need intuition to figure out the most information-rich features and build them. Without intuition, we wouldn’t be able to figure out what models would be best to try out (contrary to what many people think, there are A LOT of models we can use, not just the more popular ones that appear in textbooks and data science MOOCs). Also, if we are to employ deep learning, which is a great way to tackle the most challenging problems out there, especially if we have a truckload of data at our disposal, then we need intuition there too, in order to figure out what architecture to employ and how to best leverage the meta-features that these deep ANNs will construct after they are trained. Things are not plug-and-play as some people tend to evangelize, especially when it comes to these modern tools. The need for some broader perspective and strategic thinking, both of which stem from intuition, is evident in all data science projects.
How do we develop intuition? Well, that’s the million dollar question. In my experience, it stems from intelligence as well as latteral thinking. When we think things more openly, much like an artist does, we tend to leverage more that part of our mind that is related to intuition. If you manage to come up with a fairly original way of dealing with the data (even if someone else somewhere has come up with it too), if you figure out some clever heuristic that will cut down the computational cost of your process, and if you build a novel ensemble to harness the signals from various models, then you are using your intuition constructively and you are thinking like a data science creative.
Intuition is closely related to creativity, which is why it is often the case that people who build data science teams look out for this characteristic in their recruits, especially if it is for a more senior position. However, for some reason they don’t use that word much (intuition) since it has some undesirable connotations. Oftentimes, intuition is considered to be in the domain of pseudo-science, since its fruits fail to be understood by the more down-to-earth practitioners of data science. Nevertheless, intuition has been used successfully by many inventors, debunking the claim that it is the domain of crackpots. The problem is that it is very hard for most people to assess intuition in an individual, which is why it is often neglected in more hands-on fields. However, if you have used your intuition in a project and have come up with a creative approach to it, that is not only original but also apparent to someone who views your work, then that’s a sign that these people cannot ignore.
So, even if intuition is not so fashionable today, when fancy A.I. tech is all the rage, it still has a place in data science. Just like the fabled warrior-magicians in the Star Wars sage manage to combine both the mastery of hands-on techniques with an intuitive approach to life (through the Force), so can we, as data scientists, employ both technical skill with intuition, to tackle the challenges of big data problems and derive actionable insight from the chaotic data we are given.
We hear a lot about deep learning (DL) lately, mainly through the social media. All kinds of professionals, especially those involved in data science, never get tired of praising it, with claims ranging from “it’s greatly enhancing the way we perform predictive analytics” to “it’s the next best thing since sliced bread or baked bread for that matter!” What few people tell us is that most of these guys (they are mainly male) have vested interests in DL, so we may want to take these claims with a pinch of salt!
Don’t get me wrong though; I do value DL and other A.I. methods for machine learning (ML). However, we need to be able to distinguish between the marketing spiel and the facts. The former is for people poised to promote DL at all costs (for their own interests), while the latter is for engineers and other down-to-earth people who prefer to form their own opinions on the matter, rather than get all infatuated with this tech like some mindless technically inept fanboy.
Deep Learning involves the training and application of large ANNs to predictive analytics problems. It requires a lot of data and it promises to provide a more robust generalization based on that data, definitely better than the already obsolete statistical models, whose performance in most big data problems leaves a lot to be desired. Still, it is not clear whether DL can tackle all kinds of problems. For example, it is quite challenging to acquire the amount of data that is needed in order to solve fraud detection or other anomaly detection problems. When it comes to classifying images, however, the data available is more than adequate to train a DL network and let it do its magic. In addition, if we are interested in finding out why data point X is predicted to be of value Y (i.e. which features of X contribute the most for this prediction), we may find that DL isn’t that helpful because of the black box problem that it inherently has, just like all other ANN-based models. If however all we care about it getting this prediction and getting it fast, a DL network is sufficient, especially if we train it offline before we deploy it on the cloud (or on a physical computer cluster, if you are more old-fashioned).
Deep Learning can be of benefit to data science as it is a powerful tool. However, it’s not the tool that is going to make all other tools obsolete. As long as there are other parts in the pipeline beyond the data engineering and data modeling ones (e.g. data visualization, communicating the results, understanding the business questions, formulating hypotheses, among others), getting a DL system to replace data scientists is a viable option only in sci-fi movies. People who fantasize about the potential of DL in data science, imagining it to be the panacea that will enable companies to replace data scientists probably don’t understand how data science works and/or how the business world works. For example, someone has to be held accountable for the predictions involved and that person will have to explain them, in comprehensive terms, to both her manager and the other stakeholders of the data science project. Clearly, no matter how sophisticated DL systems are, they are unable to undertake these tasks. As for hiring some technically brilliant idiot to operate these systems and be a make-believe data scientist, with the salary of an average IT professional, well that’s definitely an option, but not one that any sane person would be likely to recommend to an organization, given that she wants to keep that organization as a client. If such a decision is to be made, it is most likely going to come from some person who cares more about pleasing his supervisor by telling her what she wants to hear, than about saying something that is bound to stand the test of time.
All in all, DL is a great tool, but we need to be realistic about its benefits. Just like any other innovative technology, it has a lot of potential, but it’s not going to solve all our problems and it’s definitely not going to replace data scientists in the foreseeable future. It can make existing data scientists more productive though, especially if they are familiar with A.I. and have some experience with using ANNs in predictive analytics. If we keep all that in mind and manage our expectations accordingly, we are bound to benefit from this promising technology and use it in tandem with other ML methods, making data science not only more efficient but also richer and even more interesting than it already is.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.