Just wanted to give you all a quick update. Lately, thanks to my marketing consultant's suggestions, I pursued alternative (more fox-like) ways to contribute to the data science field through my writing. Clearly the tech evangelists and the faux data scientists out there have done so much damage to it through the spreading of unrealistic promises and unnecessary hype about it, so it needs all the help it can get! So, I've decided to join the Data Science Partnership team as Head of Content and publish data science and A.I. related articles there, in an attempt to reach a broader audience.
I will continue posting stuff in Foxy Data Science too, but not as frequently as before since my focus is on quality rather than quantity. Also, being in the process of writing a new book (through Technics Publications) takes up some of my time too, leaving me less time to blog. More on that in a later post...
If you are interested in checking out my new articles at DSP blog, you can find them at the corresponding site. Thank you for taking the time to read my stuff. I look forward to sharing more!
Everyone uses text in digital format, especially since the rise of social media. That’s why text has become one of the most commonly used resources in data science. In his latest book, Turning Text into Gold (Technics Publications), the father of data warehousing takes a stab at this intriguing topic. Here, we’ll take a look at his book from a couple of different angles (rather than just giving an opinionated review like pretty much everyone else does on e-commerce sites).
Before we start, let me say for the record that I don’t know Bill Inmon personally, nor has anyone asked me to review any of his books. I just find the topics he deals with quite interesting and worth learning more about, even if they don’t directly relate to my field of expertise.
In his book, Turning Text into Gold, Bill Inmon examines various topics related to text modeling and NLP. Namely, he looks at taxonomies, ontologies, databases briefly, text data types, text analytics, two different levels of text processing, and four different use cases of text analytics across the industry. His overall style is high-level, while the book is rich with diagrams to clarify the points he makes. He also has a number of examples in every chapter to clarify these points further. The structural complexity of the text is fairly basic, so everyone can read it, even on a busy coffee shop or while riding the bus. Make no mistake, however: the book is not targeted at novices. In fact, in order to make the most of this resource, you’ll need some basic understanding of text analytics, otherwise it is bound to appear a bit abstract.
Data Modeling Perspective
From a data architect’s viewpoint, this book covers the topic very extensively. The author’s expertise in the field becomes abundantly clear from the get-go, as he explains the key concepts of text-related data structures in such simple terms that only a true master of the field could. Without hiding behind jargon or complex text structures, he presents the main ideas of each topic elegantly and with enough detail to make them comprehensive. It would be great if he would add a few links or references in general for further investigation, however, as some topics are quite deep and may require more research for someone new to this field.
Data Science Perspective
From a data scientist’s perspective, this book is not very relevant, unless you are already an expert in NLP. The author doesn’t provide any guidance about how to implement any of the ideas he exhibits, nor does he hint towards any particular packages / tools for applying the frameworks he describes. So, if you are a data scientist who is new to NLP and text analytics in general, you may find this book a bit too introductory. Nevertheless, if you read it in conjunction with other, more low-level books, you may find it very insightful. Also, if you are already adept in the techniques of NLP, you may find it very useful for understanding where everything fits, in the bigger picture.
Just like the alchemists of our times, who aim to turn low-value data into gold, the reader can make a similar transmutation of the text of this book. However, she may need to combine its contents with know-how from other sources, for a smoother process. Nevertheless, this book is an excellent introductory resource to the field of text analytics, which has a lot to offer to both data modeling and data science alike.
After talking with my publisher, I got him to offer a 25% discount option for my latest book, to the people I connect with through this blog and through beBee. This discount applies to both the printed version and the PDF. So, if you found the price was too steep for you, here is a promo code you can apply to get a 25% off, for a limited time only: Zack25
A public domain photo refactored as a piece of digital art by a deep learning network, on my laptop
Painting is not my favorite art. Nevertheless, I do enjoy it more than most people (apart from people who actually practice the art perhaps), since it’s easy on the eyes and the meaning it tries to convey is far easier to grasp than any other art. Creating something in this art form is very time-consuming though, which is why I admire those who have the patience to make something beautiful out of their canvases and their paints. Also, it takes a special kind of intelligence to be able to create in this domain. Could it be that artificial intelligence can emulate that? The answer is yes!
Over the years, machines have been used in a variety of creative tasks, particularly music. This is obvious for those who have delved into this art but I don’t want to get into a tangent here. Doing something creative with A.I. in the painting domain is whole different kind of challenge though, especially if you don’t know much about the art, like most A.I. people. Of course everyone can do some rudimentary kind of drawing but does that qualify as art? I doubt it and I’m sure anyone who has indulged into the fascinating history of art would agree. There is something else when it comes to making a painting, something that has eluded A.I. algorithms… up until now.
So, what is A.I.-based painting? Well, it is digital for starters. It’s not like the A.I. picks up a palette and a brush and starts coloring a canvas (although I wouldn’t be surprised if there were robots out there equipped with such an A.I. doing just that). Most A.I. systems that can paint do so with a deep learning system that has been trained in a particular style of painting. As an input, such an A.I. system usually takes a digital image, which is the equivalent of the idea or subject that usually fuels such creative endeavors in human beings. What the A.I. does after that is create a new image that makes use of the primary features of the original image (the quintessence of the subject, if you will). These features, which correspond to a particular color palette, shapes, locations of these shapes, and other relevant information, are then processed by the deep learning network they employ. The output of that network is then mapped into a form akin to the original image or corresponding to a set of specifications regarding the resolution of the art piece. The output, naturally, is an image of that resolution. Of course, the A.I. doesn’t have a clue of what it is doing, but given enough training data in its deep learning network, it can perform the task quite creatively.
Although most such synthetic artistic products are very interesting, not all of them are particularly pleasing or even worth the wait of the whole creation process (which is non-trivial when undertaken by a single computer, even if the deep learning networks are already trained beforehand). So, if you are an artist committed to this particular art form, you shouldn’t worry about your work becoming outsourced to machines any time soon! Whatever the case, applications like this are by far more meaningful than other, less thoughtful uses of computational resources for A.I. purposes. This, however, is probably the topic of a future post on the subject…
Julia has been a topic of controversy in the previous year, the year that was critical for the language’s future, at least in the data science domain. In the beginning of that year, while working at a small-medium company as a data science contractor, I remember making the argument that Julia is ready for data science and that we should give it a shot. Both the people of that company and the people of a vendor company (a local data science start-up that was acquired by Apple later that year) were very skeptical about this. Claims that “Julia is not data science ready” which floated all over the web seemed to echo in our conversations as well.
Later that year I focused on my book on the language and its applications on data science, a book I had started writing the previous Fall. At that point no-one else seemed to care about Julia in the data science community and the big players in the corporate world that had a say about data science (e.g. Amazon, Microsoft, etc.) didn’t seem to even take notice on this promising technology. Still, I knew that the merits of this language would one day surface in people’s minds as well as in the web. So, I finished the book, got it published, and gave a couple of talks on the language. Even though it was the first book to have ever be written on this topic (focusing on the data science applications of Julia), it was soon followed by another one from another publisher, bearing the same title! Also, a few days before I gave my first talks on the subject, Julia entered the top 50 languages in the TIOBE index for that month (blog article from Julia Computing). Clearly the claim that Julia was not data science ready had started to seem like an opinion of the less informed people.
It was that Fall, about a year after I’d started working feverishly on my Julia book, that Amazon took a very bold step, which I consider to be the tipping point. That Fall, Julia started to rise in the eyes of the corporate world, as Amazon adopted the MXNet deep learning framework, which included Julia as one of the languages that it supported (MXNet article on my blog). The researchers involved in this project even published a scientific article about this, in collaboration with the University of Washington, a very prestigious academic institution that was one of the first ones to popularize data science education through its corresponding programs.
After that point, Julia was officially a fully cloud-supported technology. Microsoft soon joined the game by adopting it in the Azure framework (blog article by a Julia user in Denmark). Even Google decided to support Julia in its Tensorflow deep learning system, which up until then was Python exclusive. It seems that the use of Julia in data science is not a fad after all!
Yet, there are still people claiming that Julia in not a data science language and that language X is the way to go because most people have been using X in the past few years. Perhaps they are right, at least subjectively. Some companies are so conservative that will probably die before admitting that the technology they are using is not the best out there. However, instead of paying attention to them, you can do your own research on the topic and form your own view on the matter. That’s what I did and I never regretted it!
Bugs are terrible and high-level mistakes are even worse! Yet, most data science books out there don't say much about them, or how we can deal with them when they arise in our data science work. Reading these books may give someone the impression that everything in the data science world is smooth and filled with rainbows, something that is (sadly) far from the truth! So, instead of being in denial about this very important matter, we can choose to tackle it calmly and intelligently. This is why I made this video, which is now available on Safari Books Online for everyone interested in having a better and more bug-free data science life. Enjoy!
Why is it important to ask questions in data science? How can you answer these questions? Where do hypotheses fit in? How does all that relate to the know-how you have? So many questions! For some answers to them, feel free to check out my latest video on Safari Books Online. As always, your feedback is always welcome...
This post is inspired by Joel Grus’s latest blog post on an interview of his and his showcasing of his Tensorflow know-how during it. Now, this interview was probably imaginary since I doubt anyone would be that foolish in an interview, but he is not afraid to make fun of himself to get a point across, something that is evident in most of his writings (including his book, Data Science from Scratch). I have no intention to promote him, but I find that his whole approach to data science is very fox-like, so it is only natural that he is mentioned in my blog!
In this dialectic post of his, Joel describes an interviewee’s efforts to come across as knowledgeable and technically adept, as he tries to solve the Fizz Buzz problem he is asked to whiteboard, using this Deep Learning package, along with Python. Of course, why someone would waste time asking him to solve such a simple problem is incomprehensible to me, but perhaps it’s for an entry level position or something. Still, if this were a data science position, the Fizz Buzz problem would be highly inappropriate as it has nothing to do with programming that’s relevant to data science. Joel goes on in his blog post to describe the great lengths he has to go to in order to get a basic neural network trained and deployed so that he can solve the problem, though even though he does nothing wrong (technically), his approach fails to yield the desired output and he fails the interview. That’s not to say that he or his tools are bad, but clearly illustrates the point he’s trying to make: advanced techniques don’t make one a good data scientist!
This is an issue with many data scientists today who have gotten intoxicated with the latest and greatest A.I. tech that’s found its way into data science. The tech itself is great and the tools it has been implemented with are also great. However, just because you can use them, it doesn’t make you a good data scientist. So what gives? Well, even though Deep Learning is a great framework for tackling tough data science problems, it fails miserably in the simpler ones, which are also quite common. Perhaps it’s the lack of data points, the fact that it takes a while to configure properly, or some other reason that depends on the problem at hand. Whatever the case, as data scientists we ought to be pragmatic and hands-on. Just because we know an advanced Machine Learning technique, it doesn’t mean that we should use it to solve all of the problems we are asked to solve. Sometimes we just need to come up with some simple heuristic and work with that.
There is an old saying that illustrates this issue the Joel describes in that post: killing a mosquito with a cannon. Yes, you may actually succeed in killing the poor insect with your fancy artillery weapon, but is that really cost-effective? Nowadays many data scientists go with the Deep Learning option because someone convinced them that it’s the best option out there in general, without sitting down for a minute and thinking if it’s the best option for the particular problem they are facing. Data science is not as simple and straight-forward an approach to problem-solving as some people make it out to be. So let’s get real for a minute and tackle problems like engineers, opting for a simple solution that works, before calling the cavalry for A.I. to help us. Being super adept may be appealing, but we first need to be adept at what we do by employing a down-to-earth approach that just works, before opting for improvements through more advanced models.
Intuition is probably the most undervalued quality in data science, even though it has played a prominent role in Science, throughout the years. Even in mathematics, intuition is very important, since it illuminates avenues of research or novel ways of tackling a particular problem. However, even though intuition is of high regard in most scientific fields today, in data science it is not valued much, especially lately, when the emphasis is on the engineering and modeling aspects of the field.
Data science involves a lot of nitty gritty work, which is why Kaggle competitions are a bit misleading when it comes to introducing the field to newcomers. Despite their practical value, they emphasize one particular aspect of data science (the most interesting one), creating the belief that it’s all about clever feature engineering and models. So, when someone goes deeper into the field they tend to shift to the other extreme and focus on the data engineering aspects of it, which constitute well over 80% of the actual work a data scientist does. Preparing the data, formating it, playing around with the variables and turning them into features, are parts of the data engineering part of the pipeline that require more grit than intuition or even intelligence. That’s all fine, but many people forget to get back to the bigger picture afterwards: what data science is all about. If you are thinking “insights” at this point, you are on the right track. However, to bridge the data to these insights, we need some intuition.
We need intuition to figure out the most information-rich features and build them. Without intuition, we wouldn’t be able to figure out what models would be best to try out (contrary to what many people think, there are A LOT of models we can use, not just the more popular ones that appear in textbooks and data science MOOCs). Also, if we are to employ deep learning, which is a great way to tackle the most challenging problems out there, especially if we have a truckload of data at our disposal, then we need intuition there too, in order to figure out what architecture to employ and how to best leverage the meta-features that these deep ANNs will construct after they are trained. Things are not plug-and-play as some people tend to evangelize, especially when it comes to these modern tools. The need for some broader perspective and strategic thinking, both of which stem from intuition, is evident in all data science projects.
How do we develop intuition? Well, that’s the million dollar question. In my experience, it stems from intelligence as well as latteral thinking. When we think things more openly, much like an artist does, we tend to leverage more that part of our mind that is related to intuition. If you manage to come up with a fairly original way of dealing with the data (even if someone else somewhere has come up with it too), if you figure out some clever heuristic that will cut down the computational cost of your process, and if you build a novel ensemble to harness the signals from various models, then you are using your intuition constructively and you are thinking like a data science creative.
Intuition is closely related to creativity, which is why it is often the case that people who build data science teams look out for this characteristic in their recruits, especially if it is for a more senior position. However, for some reason they don’t use that word much (intuition) since it has some undesirable connotations. Oftentimes, intuition is considered to be in the domain of pseudo-science, since its fruits fail to be understood by the more down-to-earth practitioners of data science. Nevertheless, intuition has been used successfully by many inventors, debunking the claim that it is the domain of crackpots. The problem is that it is very hard for most people to assess intuition in an individual, which is why it is often neglected in more hands-on fields. However, if you have used your intuition in a project and have come up with a creative approach to it, that is not only original but also apparent to someone who views your work, then that’s a sign that these people cannot ignore.
So, even if intuition is not so fashionable today, when fancy A.I. tech is all the rage, it still has a place in data science. Just like the fabled warrior-magicians in the Star Wars sage manage to combine both the mastery of hands-on techniques with an intuitive approach to life (through the Force), so can we, as data scientists, employ both technical skill with intuition, to tackle the challenges of big data problems and derive actionable insight from the chaotic data we are given.
We hear a lot about deep learning (DL) lately, mainly through the social media. All kinds of professionals, especially those involved in data science, never get tired of praising it, with claims ranging from “it’s greatly enhancing the way we perform predictive analytics” to “it’s the next best thing since sliced bread or baked bread for that matter!” What few people tell us is that most of these guys (they are mainly male) have vested interests in DL, so we may want to take these claims with a pinch of salt!
Don’t get me wrong though; I do value DL and other A.I. methods for machine learning (ML). However, we need to be able to distinguish between the marketing spiel and the facts. The former is for people poised to promote DL at all costs (for their own interests), while the latter is for engineers and other down-to-earth people who prefer to form their own opinions on the matter, rather than get all infatuated with this tech like some mindless technically inept fanboy.
Deep Learning involves the training and application of large ANNs to predictive analytics problems. It requires a lot of data and it promises to provide a more robust generalization based on that data, definitely better than the already obsolete statistical models, whose performance in most big data problems leaves a lot to be desired. Still, it is not clear whether DL can tackle all kinds of problems. For example, it is quite challenging to acquire the amount of data that is needed in order to solve fraud detection or other anomaly detection problems. When it comes to classifying images, however, the data available is more than adequate to train a DL network and let it do its magic. In addition, if we are interested in finding out why data point X is predicted to be of value Y (i.e. which features of X contribute the most for this prediction), we may find that DL isn’t that helpful because of the black box problem that it inherently has, just like all other ANN-based models. If however all we care about it getting this prediction and getting it fast, a DL network is sufficient, especially if we train it offline before we deploy it on the cloud (or on a physical computer cluster, if you are more old-fashioned).
Deep Learning can be of benefit to data science as it is a powerful tool. However, it’s not the tool that is going to make all other tools obsolete. As long as there are other parts in the pipeline beyond the data engineering and data modeling ones (e.g. data visualization, communicating the results, understanding the business questions, formulating hypotheses, among others), getting a DL system to replace data scientists is a viable option only in sci-fi movies. People who fantasize about the potential of DL in data science, imagining it to be the panacea that will enable companies to replace data scientists probably don’t understand how data science works and/or how the business world works. For example, someone has to be held accountable for the predictions involved and that person will have to explain them, in comprehensive terms, to both her manager and the other stakeholders of the data science project. Clearly, no matter how sophisticated DL systems are, they are unable to undertake these tasks. As for hiring some technically brilliant idiot to operate these systems and be a make-believe data scientist, with the salary of an average IT professional, well that’s definitely an option, but not one that any sane person would be likely to recommend to an organization, given that she wants to keep that organization as a client. If such a decision is to be made, it is most likely going to come from some person who cares more about pleasing his supervisor by telling her what she wants to hear, than about saying something that is bound to stand the test of time.
All in all, DL is a great tool, but we need to be realistic about its benefits. Just like any other innovative technology, it has a lot of potential, but it’s not going to solve all our problems and it’s definitely not going to replace data scientists in the foreseeable future. It can make existing data scientists more productive though, especially if they are familiar with A.I. and have some experience with using ANNs in predictive analytics. If we keep all that in mind and manage our expectations accordingly, we are bound to benefit from this promising technology and use it in tandem with other ML methods, making data science not only more efficient but also richer and even more interesting than it already is.