After talking with my publisher, I got him to offer a 25% discount option for my latest book, to the people I connect with through this blog and through beBee. This discount applies to both the printed version and the PDF. So, if you found the price was too steep for you, here is a promo code you can apply to get a 25% off, for a limited time only: Zack25
A public domain photo refactored as a piece of digital art by a deep learning network, on my laptop
Painting is not my favorite art. Nevertheless, I do enjoy it more than most people (apart from people who actually practice the art perhaps), since it’s easy on the eyes and the meaning it tries to convey is far easier to grasp than any other art. Creating something in this art form is very time-consuming though, which is why I admire those who have the patience to make something beautiful out of their canvases and their paints. Also, it takes a special kind of intelligence to be able to create in this domain. Could it be that artificial intelligence can emulate that? The answer is yes!
Over the years, machines have been used in a variety of creative tasks, particularly music. This is obvious for those who have delved into this art but I don’t want to get into a tangent here. Doing something creative with A.I. in the painting domain is whole different kind of challenge though, especially if you don’t know much about the art, like most A.I. people. Of course everyone can do some rudimentary kind of drawing but does that qualify as art? I doubt it and I’m sure anyone who has indulged into the fascinating history of art would agree. There is something else when it comes to making a painting, something that has eluded A.I. algorithms… up until now.
So, what is A.I.-based painting? Well, it is digital for starters. It’s not like the A.I. picks up a palette and a brush and starts coloring a canvas (although I wouldn’t be surprised if there were robots out there equipped with such an A.I. doing just that). Most A.I. systems that can paint do so with a deep learning system that has been trained in a particular style of painting. As an input, such an A.I. system usually takes a digital image, which is the equivalent of the idea or subject that usually fuels such creative endeavors in human beings. What the A.I. does after that is create a new image that makes use of the primary features of the original image (the quintessence of the subject, if you will). These features, which correspond to a particular color palette, shapes, locations of these shapes, and other relevant information, are then processed by the deep learning network they employ. The output of that network is then mapped into a form akin to the original image or corresponding to a set of specifications regarding the resolution of the art piece. The output, naturally, is an image of that resolution. Of course, the A.I. doesn’t have a clue of what it is doing, but given enough training data in its deep learning network, it can perform the task quite creatively.
Although most such synthetic artistic products are very interesting, not all of them are particularly pleasing or even worth the wait of the whole creation process (which is non-trivial when undertaken by a single computer, even if the deep learning networks are already trained beforehand). So, if you are an artist committed to this particular art form, you shouldn’t worry about your work becoming outsourced to machines any time soon! Whatever the case, applications like this are by far more meaningful than other, less thoughtful uses of computational resources for A.I. purposes. This, however, is probably the topic of a future post on the subject…
Julia has been a topic of controversy in the previous year, the year that was critical for the language’s future, at least in the data science domain. In the beginning of that year, while working at a small-medium company as a data science contractor, I remember making the argument that Julia is ready for data science and that we should give it a shot. Both the people of that company and the people of a vendor company (a local data science start-up that was acquired by Apple later that year) were very skeptical about this. Claims that “Julia is not data science ready” which floated all over the web seemed to echo in our conversations as well.
Later that year I focused on my book on the language and its applications on data science, a book I had started writing the previous Fall. At that point no-one else seemed to care about Julia in the data science community and the big players in the corporate world that had a say about data science (e.g. Amazon, Microsoft, etc.) didn’t seem to even take notice on this promising technology. Still, I knew that the merits of this language would one day surface in people’s minds as well as in the web. So, I finished the book, got it published, and gave a couple of talks on the language. Even though it was the first book to have ever be written on this topic (focusing on the data science applications of Julia), it was soon followed by another one from another publisher, bearing the same title! Also, a few days before I gave my first talks on the subject, Julia entered the top 50 languages in the TIOBE index for that month (blog article from Julia Computing). Clearly the claim that Julia was not data science ready had started to seem like an opinion of the less informed people.
It was that Fall, about a year after I’d started working feverishly on my Julia book, that Amazon took a very bold step, which I consider to be the tipping point. That Fall, Julia started to rise in the eyes of the corporate world, as Amazon adopted the MXNet deep learning framework, which included Julia as one of the languages that it supported (MXNet article on my blog). The researchers involved in this project even published a scientific article about this, in collaboration with the University of Washington, a very prestigious academic institution that was one of the first ones to popularize data science education through its corresponding programs.
After that point, Julia was officially a fully cloud-supported technology. Microsoft soon joined the game by adopting it in the Azure framework (blog article by a Julia user in Denmark). Even Google decided to support Julia in its Tensorflow deep learning system, which up until then was Python exclusive. It seems that the use of Julia in data science is not a fad after all!
Yet, there are still people claiming that Julia in not a data science language and that language X is the way to go because most people have been using X in the past few years. Perhaps they are right, at least subjectively. Some companies are so conservative that will probably die before admitting that the technology they are using is not the best out there. However, instead of paying attention to them, you can do your own research on the topic and form your own view on the matter. That’s what I did and I never regretted it!
Bugs are terrible and high-level mistakes are even worse! Yet, most data science books out there don't say much about them, or how we can deal with them when they arise in our data science work. Reading these books may give someone the impression that everything in the data science world is smooth and filled with rainbows, something that is (sadly) far from the truth! So, instead of being in denial about this very important matter, we can choose to tackle it calmly and intelligently. This is why I made this video, which is now available on Safari Books Online for everyone interested in having a better and more bug-free data science life. Enjoy!
Why is it important to ask questions in data science? How can you answer these questions? Where do hypotheses fit in? How does all that relate to the know-how you have? So many questions! For some answers to them, feel free to check out my latest video on Safari Books Online. As always, your feedback is always welcome...
This post is inspired by Joel Grus’s latest blog post on an interview of his and his showcasing of his Tensorflow know-how during it. Now, this interview was probably imaginary since I doubt anyone would be that foolish in an interview, but he is not afraid to make fun of himself to get a point across, something that is evident in most of his writings (including his book, Data Science from Scratch). I have no intention to promote him, but I find that his whole approach to data science is very fox-like, so it is only natural that he is mentioned in my blog!
In this dialectic post of his, Joel describes an interviewee’s efforts to come across as knowledgeable and technically adept, as he tries to solve the Fizz Buzz problem he is asked to whiteboard, using this Deep Learning package, along with Python. Of course, why someone would waste time asking him to solve such a simple problem is incomprehensible to me, but perhaps it’s for an entry level position or something. Still, if this were a data science position, the Fizz Buzz problem would be highly inappropriate as it has nothing to do with programming that’s relevant to data science. Joel goes on in his blog post to describe the great lengths he has to go to in order to get a basic neural network trained and deployed so that he can solve the problem, though even though he does nothing wrong (technically), his approach fails to yield the desired output and he fails the interview. That’s not to say that he or his tools are bad, but clearly illustrates the point he’s trying to make: advanced techniques don’t make one a good data scientist!
This is an issue with many data scientists today who have gotten intoxicated with the latest and greatest A.I. tech that’s found its way into data science. The tech itself is great and the tools it has been implemented with are also great. However, just because you can use them, it doesn’t make you a good data scientist. So what gives? Well, even though Deep Learning is a great framework for tackling tough data science problems, it fails miserably in the simpler ones, which are also quite common. Perhaps it’s the lack of data points, the fact that it takes a while to configure properly, or some other reason that depends on the problem at hand. Whatever the case, as data scientists we ought to be pragmatic and hands-on. Just because we know an advanced Machine Learning technique, it doesn’t mean that we should use it to solve all of the problems we are asked to solve. Sometimes we just need to come up with some simple heuristic and work with that.
There is an old saying that illustrates this issue the Joel describes in that post: killing a mosquito with a cannon. Yes, you may actually succeed in killing the poor insect with your fancy artillery weapon, but is that really cost-effective? Nowadays many data scientists go with the Deep Learning option because someone convinced them that it’s the best option out there in general, without sitting down for a minute and thinking if it’s the best option for the particular problem they are facing. Data science is not as simple and straight-forward an approach to problem-solving as some people make it out to be. So let’s get real for a minute and tackle problems like engineers, opting for a simple solution that works, before calling the cavalry for A.I. to help us. Being super adept may be appealing, but we first need to be adept at what we do by employing a down-to-earth approach that just works, before opting for improvements through more advanced models.
Intuition is probably the most undervalued quality in data science, even though it has played a prominent role in Science, throughout the years. Even in mathematics, intuition is very important, since it illuminates avenues of research or novel ways of tackling a particular problem. However, even though intuition is of high regard in most scientific fields today, in data science it is not valued much, especially lately, when the emphasis is on the engineering and modeling aspects of the field.
Data science involves a lot of nitty gritty work, which is why Kaggle competitions are a bit misleading when it comes to introducing the field to newcomers. Despite their practical value, they emphasize one particular aspect of data science (the most interesting one), creating the belief that it’s all about clever feature engineering and models. So, when someone goes deeper into the field they tend to shift to the other extreme and focus on the data engineering aspects of it, which constitute well over 80% of the actual work a data scientist does. Preparing the data, formating it, playing around with the variables and turning them into features, are parts of the data engineering part of the pipeline that require more grit than intuition or even intelligence. That’s all fine, but many people forget to get back to the bigger picture afterwards: what data science is all about. If you are thinking “insights” at this point, you are on the right track. However, to bridge the data to these insights, we need some intuition.
We need intuition to figure out the most information-rich features and build them. Without intuition, we wouldn’t be able to figure out what models would be best to try out (contrary to what many people think, there are A LOT of models we can use, not just the more popular ones that appear in textbooks and data science MOOCs). Also, if we are to employ deep learning, which is a great way to tackle the most challenging problems out there, especially if we have a truckload of data at our disposal, then we need intuition there too, in order to figure out what architecture to employ and how to best leverage the meta-features that these deep ANNs will construct after they are trained. Things are not plug-and-play as some people tend to evangelize, especially when it comes to these modern tools. The need for some broader perspective and strategic thinking, both of which stem from intuition, is evident in all data science projects.
How do we develop intuition? Well, that’s the million dollar question. In my experience, it stems from intelligence as well as latteral thinking. When we think things more openly, much like an artist does, we tend to leverage more that part of our mind that is related to intuition. If you manage to come up with a fairly original way of dealing with the data (even if someone else somewhere has come up with it too), if you figure out some clever heuristic that will cut down the computational cost of your process, and if you build a novel ensemble to harness the signals from various models, then you are using your intuition constructively and you are thinking like a data science creative.
Intuition is closely related to creativity, which is why it is often the case that people who build data science teams look out for this characteristic in their recruits, especially if it is for a more senior position. However, for some reason they don’t use that word much (intuition) since it has some undesirable connotations. Oftentimes, intuition is considered to be in the domain of pseudo-science, since its fruits fail to be understood by the more down-to-earth practitioners of data science. Nevertheless, intuition has been used successfully by many inventors, debunking the claim that it is the domain of crackpots. The problem is that it is very hard for most people to assess intuition in an individual, which is why it is often neglected in more hands-on fields. However, if you have used your intuition in a project and have come up with a creative approach to it, that is not only original but also apparent to someone who views your work, then that’s a sign that these people cannot ignore.
So, even if intuition is not so fashionable today, when fancy A.I. tech is all the rage, it still has a place in data science. Just like the fabled warrior-magicians in the Star Wars sage manage to combine both the mastery of hands-on techniques with an intuitive approach to life (through the Force), so can we, as data scientists, employ both technical skill with intuition, to tackle the challenges of big data problems and derive actionable insight from the chaotic data we are given.
We hear a lot about deep learning (DL) lately, mainly through the social media. All kinds of professionals, especially those involved in data science, never get tired of praising it, with claims ranging from “it’s greatly enhancing the way we perform predictive analytics” to “it’s the next best thing since sliced bread or baked bread for that matter!” What few people tell us is that most of these guys (they are mainly male) have vested interests in DL, so we may want to take these claims with a pinch of salt!
Don’t get me wrong though; I do value DL and other A.I. methods for machine learning (ML). However, we need to be able to distinguish between the marketing spiel and the facts. The former is for people poised to promote DL at all costs (for their own interests), while the latter is for engineers and other down-to-earth people who prefer to form their own opinions on the matter, rather than get all infatuated with this tech like some mindless technically inept fanboy.
Deep Learning involves the training and application of large ANNs to predictive analytics problems. It requires a lot of data and it promises to provide a more robust generalization based on that data, definitely better than the already obsolete statistical models, whose performance in most big data problems leaves a lot to be desired. Still, it is not clear whether DL can tackle all kinds of problems. For example, it is quite challenging to acquire the amount of data that is needed in order to solve fraud detection or other anomaly detection problems. When it comes to classifying images, however, the data available is more than adequate to train a DL network and let it do its magic. In addition, if we are interested in finding out why data point X is predicted to be of value Y (i.e. which features of X contribute the most for this prediction), we may find that DL isn’t that helpful because of the black box problem that it inherently has, just like all other ANN-based models. If however all we care about it getting this prediction and getting it fast, a DL network is sufficient, especially if we train it offline before we deploy it on the cloud (or on a physical computer cluster, if you are more old-fashioned).
Deep Learning can be of benefit to data science as it is a powerful tool. However, it’s not the tool that is going to make all other tools obsolete. As long as there are other parts in the pipeline beyond the data engineering and data modeling ones (e.g. data visualization, communicating the results, understanding the business questions, formulating hypotheses, among others), getting a DL system to replace data scientists is a viable option only in sci-fi movies. People who fantasize about the potential of DL in data science, imagining it to be the panacea that will enable companies to replace data scientists probably don’t understand how data science works and/or how the business world works. For example, someone has to be held accountable for the predictions involved and that person will have to explain them, in comprehensive terms, to both her manager and the other stakeholders of the data science project. Clearly, no matter how sophisticated DL systems are, they are unable to undertake these tasks. As for hiring some technically brilliant idiot to operate these systems and be a make-believe data scientist, with the salary of an average IT professional, well that’s definitely an option, but not one that any sane person would be likely to recommend to an organization, given that she wants to keep that organization as a client. If such a decision is to be made, it is most likely going to come from some person who cares more about pleasing his supervisor by telling her what she wants to hear, than about saying something that is bound to stand the test of time.
All in all, DL is a great tool, but we need to be realistic about its benefits. Just like any other innovative technology, it has a lot of potential, but it’s not going to solve all our problems and it’s definitely not going to replace data scientists in the foreseeable future. It can make existing data scientists more productive though, especially if they are familiar with A.I. and have some experience with using ANNs in predictive analytics. If we keep all that in mind and manage our expectations accordingly, we are bound to benefit from this promising technology and use it in tandem with other ML methods, making data science not only more efficient but also richer and even more interesting than it already is.
Contrary to what many people will have you believe, imagination is not just for researchers and PMs, when it comes to data science. Every single aspect of a scientific project has some imagination in it, even the most mundane and straight-forward parts. As data science involves a lot of creativity, at least for the time being, it is not far-fetched to presume that imagination has an important role to play in the field.
By the term imagination I mean the conscious use of the mind for projecting new forms or perceiving forms that could be, but are not manifested. It is very different from the unconscious use of the mind, which is what psychologists refer to as fantasy, a fairly futile endeavor that frequents the undisciplined and immature minds. As data scientists we often need to see what is not there and create it if it’s useful, or find some way to deal with it if it’s a potential issue.
Of course there are those hard-core Deep Learning (DL) people out there who believe that with a good enough DL network you don’t really need to worry about all this matter. They advocate the idea that A.I. can take care of all this through the systematic and/or stochastic handling of all the possibilities in a feature set, yielding the optimum collection of features that it will then use for the task at hand. Although there is no doubt that a good enough ANN can do all that, it still doesn’t solve all the potential issues, nor does it make the role of a human being unnecessary. Just like a good motorbike can alleviate a lot of the hard work required for getting from A to B, it still doesn’t eliminate the need for someone who steers the vehicle and keeps it safely on the road.
Imagination is our navigator in many projects and although it often lends itself to feature engineering and other data engineering tasks, it is also useful for something else that no A.I. has managed to achieve yet: the development of hypotheses and a plan of action based on the data at hand. Data science is not all about getting a model working and coming up with some good score in a performance metric. This is just one aspect of it, the one that Kaggle focuses on, in order to make its competitions more appealing. However, a large part of the data science work involves exploring the data and figuring out what kind of insights it can yield. A robust A.I. system can be an invaluable aid in all that, but we cannot outsource this task to it, no matter how many GPUs we use or how slick the training algorithms the system employs. Just like an organization cannot function properly if its members are all complete imbecils (the components an automated DS system comprises of), a DS project needs some higher intelligence too (the equivalent of a competent manager in the aforementioned organization).
We need to set goals in our project and foresee potential problems and opportunities, before we come to that part of the pipeline, otherwise we risk having to go back and forth, wasting valuable resources. So, even though focused and meticulous work is essential, being able to step back and see the bigger picture is equally important. That’s why oftentimes a data science endeavor is handled by a team of professional, with the data science lead undertaking that role. So, if you want to make things happen in data science, something that an A.I. is unable to undertake, you need to use imagination. The latter, along with the systematic aspects of the role, can lead you to the desired outcome of your data science project, be it insights or a data product. Imagine that!
The other day I was talking with an acquaintance of mine who is the CEO of a local startup in London and I was astonished to discover that the faux data science trend that plagues the West Coast of the US is in London too. The British capital is not only the home of the top A.I. startup, Deep Mind, which was acquired by Google lately, but it’s also the place where a great deal of data scientists have come about. Also, it prides itself for its pragmatism and for how grounded it is, especially when it comes to science. Still, somehow many of the data science practitioners in this great city are what I call faux data scientists, professionals who use the term “data scientist” on their business card, even though they have no real relation to the field.
Contrary to a real data scientist, the profile of whom I describe in my first book, a faux one is both confusing and confused. A (true) data scientist focuses on predictive analytics, usually through the use of ML systems and lately systems powered by A.I., even though he also makes use of Statistics in various ways. A faux data scientist, on the other hand, relies mainly on Stats and some of the most rudimentary ML models (though he may use ANNs too, without bothering to configure them properly or even read up on the corresponding scientific literature). While a data scientist relies on science to obtain insights and makes use of various methods for communicating her results, a faux one creates pretty plots that may or may not convey any real findings, though they may impress his audience. A faux data scientist usually outperforms the real data scientist in another thing: BS talk. The real data scientist tends to be more humble and veers away from extravagant claims about what the data can yield. This is particularly true if he comes from an academic background. However, the faux data scientist has no inhibitions when it comes to making excessive promises and delivering insights that would qualify for a Nobel prize, if they held water. In other words, the faux data scientist is full of hot air, but manages to hide all the BS of his methodology behind fancy talk brimming with buzzwords and anything else he could come up with in order to convince (or please, rather) his audience.
Unfortunately the damage that a faux data scientist goes beyond his personal work. Given enough time, the managers of the corresponding projects will see through all the BS this “data scientist” does. That’s the time when the faux data scientist will probably leave or go about to start his own company. However, the loss of confidence in the profession is bound to linger. And even though it took years of hard work and equally hard research in the field to build this confidence, it’s not as strong as it needs to be in order to sustain this kind of damage. Of course the faux data scientist doesn’t care because he’s in it for the money, the reputation, or whatever other personal gain his ambitions dictate. He hasn’t done any research on the science behind the techniques and is adept only at applying other people’s work, through the myriad of Python and R packages that are out there. But it’s not all bad. As he is bound to talk his way into all sorts of situations, once the field no longer serves his purposes, he is bound to jump ship to some other field (whatever is trendy at that time) and never look back.
However, the faux data scientist is not to blame entirely for all this. She is just taking advantage of the situation, particularly the fact that the hiring managers look for 1) x years for experience in the field, experience they are unable or even incapable of assessing accurately, and 2) someone with “excellent communication skills”, especially when it comes to showcasing projects brimming with eye candy and buzz words the management will recognize. So, unless we start seeing through the BS of the faux data scientists and treat them the way they deserve, this situation is not going to go away any time soon…