Lately everyone likes to talk big picture when it comes to data science and artificial intelligence. I’m guilty of this too, since this kind of talk lends itself for blogging. However, it is easy to get carried away and forget that data science is a very detailed process that requires meticulous work. After all, no matter how much automation takes place, mistakes are always possible and oftentimes unavoidable. Even if programming bugs are easier to identify and even prevent, to some extent, some problem may still arise and it is the data scientist’s obligation to handle them effectively. I’ll give an example from a recent project of mine, a PoC in the text analytics field. The idea was to develop a bunch of features from various texts and then use them to build an unsupervised learning model. Everything in the design and the core functions was smooth, even from the first draft of the code. Yet, when running one of the scripts, the computer kept running out of memory. That’s a big issue, considering that the text corpus was not huge, plus the machine used to run the programs is a pretty robust system, with 16GB of RAM, while it’s also running Linux (so a solid 15GB of RAM are available to the programming language to utilize as needed). Yet, the script would cause the system to slow down until it would eventually freeze (no swap partition was set up when I was installing the OS, since I didn’t expect to ever run out of memory on this machine!) Of course, the problem could be resolved by adding a swap option to the OS, but that still would not be a satisfactory solution, at least not for someone who opts for writing efficient code. After all, when building a system, it is usually built to scale well and this prototype of mine didn’t look very scalable. So, I examined the code carefully and came up with various hacks to manage resources better. Also, I got rid of some unnecessary array that was eating up a lot of memory, and rerouted the information flow so that other arrays can be used to provide the same result. After a couple of attempts, the system was running smoothly and without using too much RAM. It’s small details like these that make the difference between a data science system that is practical and one that is good only on the conceptual level (or one that requires a large cluster to run properly). Unfortunately, that’s something that is hard to learn through books, videos, or other educational material. Perhaps even conventional experience may not trigger this kind of lesson, though perhaps a good mentor might be very beneficial in such cases. The morale of the story for me is that we ought to continuously challenge ourselves in data science and never be content with our aptitude level. Just because something runs without errors identifiable by the language compiler, doesn’t mean that it’s production-ready. Even in the case of a simple PoC, like this one, we cannot afford to lose focus. Just like the data that is constantly evolving into more and more refined information, data scientists follow a similar process, as we grow into more refined manifestations of the craft.
1 Comment
9/5/2017 03:30:51 pm
One of the things they teach you in computer science (full disclosure: I am not a computer scientist) is picking an algorithm that trades space and time. The most recent example I encountered was parsing a large corpus of text in xml format. There is a library, lxml, that will handle xml as a stream, but I couldn't figure out the "now what" after I was done because XPATH queries didn't work. The answer was to use a (small) subset of the corpus as a PoC and con luded (correctly) that this was a blind alley.
Reply
Your comment will be posted after it is approved.
Leave a Reply. |
Zacharias Voulgaris, PhDPassionate data scientist with a foxy approach to technology, particularly related to A.I. Archives
April 2024
Categories
All
|