Kaggle is an online community (social medium?) geared towards data science enthusiasts. It aims to link data analytics problems with data analysts with a flair for machine learning and other data science methodologies. I’ve even mentioned in in my books and videos at times, as a potential resource for gaining some traction with the data modeling aspect of the field and to get people more interested in the craft. However, some people confuse Kaggle experience with data science experience. Let’s delve more into this matter.
Kaggle problems tend to be geared towards analysts, not data scientists. The whole nature of a competition is one-sided (aiming to optimize a particular evaluation metric), which is not the same as creating a successful data science product. Although the model in the back-end of a data science product has to be somewhat accurate, there are other factors involved that are equally important (if not more important) than the model’s raw performance. For example, the amount of resources it uses, its interpretability, how easy it is to use, its maintenance costs, and how compatible it is with existing technologies, are all factors that are relevant when building a data science model.
What’s more, the data in Kaggle competitions is very much like baby formula. In the real world, data is far more complex, more noisy, and stemming from a variety of sources. So, if someone can handle Kaggle data, that’s great, but it’s not the same as handling real-world data, no matter how robust his models are. In fact, many people argue that around 80% of the data scientist’s work, time-wise, is getting the data clean and ready for the data model.
Of course, it’s not just Kaggle that provides this false image of data science, there are other platforms that are similar. The UCI repository, for example, is similar, though in all fairness, they state that their datasets are there for research purposes. Still, they lend themselves for practicing data modeling and trying out heuristics and new data analytics or A.I. algorithms.
So, when it comes to data science experience, it is important to remember that data modeling is just one part of it. It is an important part, but not the whole picture. If you want to gain real-world experience, you are better off getting some data from real-world sources, such as social media feeds, sensor data from IoT devices, etc. rather than Kaggle competitions. The latter are fun, but data science is not all fun and games. Much like hackathons are great for practicing coding, when it comes to doing programming in the real world, you need more than just hackathon experience. So, why would data science be any different?
Short answer: no. Long answer: although some experience is positively correlated with aptitude in the field, the relationship between the two is neither linear, nor straight-forward. Let’s delve into this more, examining the lesser known aspects of it.
If someone is in the beginning of their career as a data scientist, chances are that having some experience is much better than no experience at all. The experience in this case involves dealing with practical challenges that are usually not described in data science books or courses, so for the inexperienced data scientist, these can be major liabilities in his work. The experienced data scientist has encountered tricky situations where the models she has built have failed and she has a better chance of avoiding similar situations, or at the very least tackling them efficiently when they occur. Do additional years of experience help a data scientist though in her career? It depends. Unmistakably, that additional experience of working in an organization allows the professional to cultivate his soft skills more and be able to work more effectively in a team. Also, his understanding of how a business works becomes more solid and functional. However, data science aptitude does not necessarily grow as the years of experience accumulate. After all, the field changes so rapidly, so having a few years more experience in it may be irrelevant, as the techniques the more experienced data scientist has mastered, may not be so useful or necessary any more.
Of course there are exceptions. If a data scientist is particularly good, due to talent, education, or some combination of the two, then the additional years of experience are going to translate into a more varied expertise and perhaps the ability to lead a team effectively. The thing is that this kind of person is going to be good even with little or no experience, since the innate talent or general aptitude due to good education are there from the get-go. Naturally these cases are few and may be considered outliers, but they are relevant enough to be valuable as they are the exception that verifies the rule.
So, what would be a good proxy for data science aptitude then, if experience is not a good enough feature to predict this valuable variable? Well, it depends on the situation. If you have an organization that deals with text a lot and requires a data scientist to be part of NLP and NLU projects, then some understanding of the language(s) and/or the ability to create and implement scalable heuristics based on text data would be very valuable. These skills would be a better proxy than having spent a number of years on the field, focusing mainly on recommender systems, for example. If an organization wants someone to work on image data and solve challenging problems related to that (e.g. object identification), then a solid understanding of image data or of deep learning techniques would be a pretty good proxy of aptitude related to this task.
Work experience has remained relevant because of its applicability in various professions. However, making the inference that just because it works well with them it should also work in data science is unscientific and reckless, at best. So perhaps organizations that value experience so much are better off being avoided since it’s doubtful that they have a solid understanding of data science, or the ability to manage this kind of resources effectively (perhaps their managers need to gain some more experience in handling certain human resources, who knows?). After all, just because most organizations can benefit from data scientists, it doesn’t mean that they are data science ready.
“Wait a minute! Isn’t data science all about cool machine learning models, number-crunching, artificial intelligence methods, and big data?” I can hear some people saying. Well, it is all that, but the one thing that binds all these different aspects of data science together is domain knowledge, or in other words, context. You may be adept in cleaning, structuring, and modeling the data at hand but if you are missing the bigger picture and how all this data (and its distillation) relates to the stakeholders of the project, then you are just an analyst! Data science is not divorced from the real world, even if in its most esoteric aspects, it may seem quite alienating to the average Joe. Data science is a business framework, among other things, and as such it constitutes an integral part of business processes. Without the latter to provide a sense of perspective and some sort of objective to the data at hand, data science is reduced to an intellectual endeavor, like modern philosophy. Even if there is value in the latter too, but it’s not what data science is about.
Context in data science manifests on various levels. At the larger scale, it’s about its relevance to the end-user and the stakeholders of the project. Because no matter how brilliant a data model is, it is wrong, as it is merely an abstraction of reality, though, if the model is crafted in a way that it provides value to the end-user, it can be useful (as George Box would put it). This value stems from the context it takes into account. Yet, context also manifests in the way the data is engineered and distilled into information. For example, there are a number of ways to do dimensionality reduction (i.e., make the number of features smaller, while in some cases making these features more compact). If you follow a recipe book blindly, you’ll probably resort to PCA, ICA, or some other off-the-shelf method. However, if you look at the problem more closely, you may employ a different strategy, particularly if you have label data at your disposal. Such additional information may impact the way the feature data is perceived and make a feature filtering approach more relevant, for example.
Perhaps it would be prudent to put data science into perspective, rather than focus on its techniques and tools only. Being mindful of the context of every part of data science pipeline is a great way to accomplish that. After all, just like every applied science, data science is geared towards people, not abstract entities to populate theories and research articles. The latter are useful, but the former are what provide our craft with meaning and business value.
A.I. is great, especially when applied to data science. Many people lately are quite concerned about the various dangers it may entail. This naturally polarizes people, splitting views of the topic into two main groups: the ones neglecting these concerns and those mirroring a fear that the end of the world is upon us. Probably the truth lies somewhere in-between, but given the lack of evidence, any speculation on the matter may be premature and likely to be inaccurate.
In this post I’d like to focus on another danger that many people don’t think much about, or don’t see it as a danger at all: the sense of complacency that may arise from a super-automated world. Of course, complacency is a human condition and has little to do with A.I. but someone may consider that it is A.I. to blame for this condition. After all, super-automation may be possible only through this new technology becoming wide-spread.
This danger, which can find its way to data science too if left unchecked, is a real one. However, it is neither singular nor catastrophic. After all, every large-scale technological innovation has brought about social changes that have triggered this condition to some extent. This does not mean that we should go back to the stone age, however. After all, technology is largely neutral and the people who make it available to the world have the best intentions in mind. So, it seems that blaming a new tech for this matter may be a bit irresponsible.
Yet, the advent of technology can be a good thing if dealt with in a mature manner. Just like you can own a car and still make time for physical exercise, you can have access to an A.I. and still be a creative and productive person. It’s all a matter of power, at the end of the day. If we give away our power, our ability to choose and to shape our lives, then we are left powerless victims of whoever has taken hold of that power. In the case of A.I., if we cherish automation so much that we outsource every task to it, then we are willingly creating our own peril. So, if we choose to maintain a presence in all processes where A.I. is involved, the latter is not going to be a threat, not a considerable one anyway.
There is no doubt that A.I. can be dangerous, much like every other technological advancement. However, it seems that the crux of the problem lies within us, rather than at the machines that incarnate this technology. If we give into a sense of complacency and allow the AIs to have a gradually more active part in our society, then maybe this tech will create more problems than the ones it’ll solve. However, if we deal with this new technological advent maturely, we can still benefit from it, without making ourselves obsolete or irrelevant, in the process.
When people think about the benefits of A.I. and its impact in our world, they usually think of self-driving cars, advanced automations, deep learning systems, clever chatbots, etc. Those particularly infatuated with the idea of A.I. tend to go even further and fantasize about super-intelligent machines that will magically solve all our problems without any effort from us (pretty much like a deus ex machina figure in some ancient theater play). However, the more pragmatic A.I. thinkers focus more on particular applications of A.I. that can be implemented fairly easily, and that target specific issues that would be impractical to solve in conventional ways. One such case is that of detecting how contaminated beehives are by a particular parasite.
Why should we care about this matter? Don’t we have larger problems to deal with? Perhaps. After all, there are more evident problems out there that require unconventional ways of tackling them, problems that could benefit a lot by a narrow A.I. designed for them. However, the issue of infested beehives is not a minor one, as it represents a real danger for the whole species of these buzzing insects. It’s worth noting that bees are not useful for just the honey they produce; they are key in plant polination, and as such they play an important role in our planet’s fragile ecosystem, that’s on the wane lately. So, it may be a big deal after all.
Developing an A.I. to tackle the beehive infestation problem is a project disproportionate to its impact, as it is fairly manageable with the existing technology, at least for a particular parasite, called the Varroa mite. These organisms can cause serious issues to the bees, issues that are observable with the naked eye. However, assessing the infestation may not be so straight-forward, making it difficult to take intelligent action against it (e.g. how can you tell which beehives are in imminent danger and prioritize accordingly?). That’s where Computer Vision comes in handy, an automated way for a computer system to evaluate what a camera attached to it observes. The images from the camera feed, when coupled with some deep learning network, can help measure the magnitude of the issue in a very small amount of time (check out a demo of an app by TopLab, that does just that). Will this be enough? Possibly, if this process is coupled with an effort to eliminate the parasites once identified. However, knowing about the infestation issue in an objective and practical manner, can definitely speed things up.
Perhaps A.I. is not as futuristic as it is often perceived, nor as high-level as it comes across. After all, just like any other applied science, it aims to solve real-world problems right here and now, in an efficient and effective manner. The question is, are we willing to apply it to more strategic problems, like the case of an impaired ecosystem, or are we going to use it only to make our urban lives more convenient? Hopefully that’s a question we can answer with just our natural intelligence...
Sentiment Analysis is a popular NLP topic that I've been involved in for a while now. I even wrote an article about it for a friend of mine, who is an editor at a marketing blog. Anyway, after I finally finished my latest book (Technics Publications, ETA: Fall 2017), I had some time to work on a video for Safari Books Online. This video is now online at Safari and is probably going to be followed by similar ones on NLP and NLU related topics. Any suggestions are welcomed!
When MAXset did its debut many people saw its uniqueness and value, though only few understood its potential. This is understandable, considering that its approach in text analytics is quite different to anything else that is out there. Also, it claims really high throughput, particularly on large corpora of text, something particularly useful for enterprises and other organizations that have lots of text data in their pipelines. Moreover, data processed by MAXset tends to be more comprehensive, so it doesn’t need some fancy data product in order to be of use to the end-user (though it lends itself for data science work too).
All this may sound nice, but does it really deliver on its promises? Here is a short video that demonstrates the core of its MVP, the transformation of a corpus into a delimited file that can be opened with any spreadsheet application. Feel free to check out the corresponding video.
What I find impressive in all this is that this MVP was developed under very adverse circumstances, with a shoestring budget, minimal computing resources, and within about 6 months. It is really amazing what can be accomplished if a team is motivated by leadership and a belief in the end-result, rather than bureaucracy and a benefits package. Also, it’s companies like this that make data science truly useful, as the end-user is not some expert in data analytics, but anyone interested in getting to the gist of the data at hand, be it a program manager, a business analyst, or a client.
MAXset is still in the beginning of its cycle. As more people become aware of it and the benefits it can offer them, it is only a matter of time before it becomes a more recognized brand in the text analytics sector. Coupled with that, its integration of additional A.I. techniques in its products is bound to make it even more relevant in today’s data-driven world. For more information, feel free to contact MAXset directly at firstname.lastname@example.org.
After more than 2 weeks, I finally finished the production of the new video I was working on, about mentoring in data science. You can find it on the Safari Books Online platform. I hope you enjoy it! Feel free to contact me through this blog, with any comments on it or suggestions for other videos on data science and A.I. related topics. Thanks!
Just wanted to give you all a quick update. Lately, thanks to my marketing consultant's suggestions, I pursued alternative (more fox-like) ways to contribute to the data science field through my writing. Clearly the tech evangelists and the faux data scientists out there have done so much damage to it through the spreading of unrealistic promises and unnecessary hype about it, so it needs all the help it can get! So, I've decided to join the Data Science Partnership team as Head of Content and publish data science and A.I. related articles there, in an attempt to reach a broader audience.
I will continue posting stuff in Foxy Data Science too, but not as frequently as before since my focus is on quality rather than quantity. Also, being in the process of writing a new book (through Technics Publications) takes up some of my time too, leaving me less time to blog. More on that in a later post...
If you are interested in checking out my new articles at DSP blog, you can find them at the corresponding site. Thank you for taking the time to read my stuff. I look forward to sharing more!
Everyone uses text in digital format, especially since the rise of social media. That’s why text has become one of the most commonly used resources in data science. In his latest book, Turning Text into Gold (Technics Publications), the father of data warehousing takes a stab at this intriguing topic. Here, we’ll take a look at his book from a couple of different angles (rather than just giving an opinionated review like pretty much everyone else does on e-commerce sites).
Before we start, let me say for the record that I don’t know Bill Inmon personally, nor has anyone asked me to review any of his books. I just find the topics he deals with quite interesting and worth learning more about, even if they don’t directly relate to my field of expertise.
In his book, Turning Text into Gold, Bill Inmon examines various topics related to text modeling and NLP. Namely, he looks at taxonomies, ontologies, databases briefly, text data types, text analytics, two different levels of text processing, and four different use cases of text analytics across the industry. His overall style is high-level, while the book is rich with diagrams to clarify the points he makes. He also has a number of examples in every chapter to clarify these points further. The structural complexity of the text is fairly basic, so everyone can read it, even on a busy coffee shop or while riding the bus. Make no mistake, however: the book is not targeted at novices. In fact, in order to make the most of this resource, you’ll need some basic understanding of text analytics, otherwise it is bound to appear a bit abstract.
Data Modeling Perspective
From a data architect’s viewpoint, this book covers the topic very extensively. The author’s expertise in the field becomes abundantly clear from the get-go, as he explains the key concepts of text-related data structures in such simple terms that only a true master of the field could. Without hiding behind jargon or complex text structures, he presents the main ideas of each topic elegantly and with enough detail to make them comprehensive. It would be great if he would add a few links or references in general for further investigation, however, as some topics are quite deep and may require more research for someone new to this field.
Data Science Perspective
From a data scientist’s perspective, this book is not very relevant, unless you are already an expert in NLP. The author doesn’t provide any guidance about how to implement any of the ideas he exhibits, nor does he hint towards any particular packages / tools for applying the frameworks he describes. So, if you are a data scientist who is new to NLP and text analytics in general, you may find this book a bit too introductory. Nevertheless, if you read it in conjunction with other, more low-level books, you may find it very insightful. Also, if you are already adept in the techniques of NLP, you may find it very useful for understanding where everything fits, in the bigger picture.
Just like the alchemists of our times, who aim to turn low-value data into gold, the reader can make a similar transmutation of the text of this book. However, she may need to combine its contents with know-how from other sources, for a smoother process. Nevertheless, this book is an excellent introductory resource to the field of text analytics, which has a lot to offer to both data modeling and data science alike.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.