Why the Role of A.I. in the Job Market Is Very Much a Business Decision Technical Professionals Can Contribute to
Lately there is a lot of talk about AIs potentially taking people’s jobs in the future and how this is either catastrophic, or some kind of utopia (or, less often, some other stance in between). Although we as data science and A.I. professionals have little to do with the high-level decisions that have some influence on this future, perhaps we are not so detached from the reality of the situation. I’m not talking about the A.I. choir that is happy to recite its fantasies about an A.I.-based future that is akin to the sci-fi films that monetize this idea. I’m talking about grounded professionals who have some experience in the development of A.I. systems, be it for data science or other fields of application.
The problem with business decisions is that they are by their nature related to quite complex problems. As such, it is practically impossible to solve them in a clear-cut manner that doesn't invite reactions, or at least some debate. That’s why those individuals who have the courage to make these decisions are paid so handsomely. It’s not the time they put in, but the responsibility they undertake, that makes their role of value. However, it is important to make these decision as future-proof as possible, something that these individuals may not be able to do on their own. That’s why they have advisors and consultants, after all. Besides, even if some of the decision-makers are technical and can understand the A.I. matters, they may lack the granularity of comprehension that an A.I. professional has.
People who make business decisions often see A.I. as a valuable resource that can help their organization in many ways (particularly cut down on some costs, via automation or increased efficiency in time-consuming or expensive processes). However, they may not always see the implications of these moves and the shortcomings of this, still not yet mature, technology. A.I. systems are not objective, nor immune to errors. After all, most of them are black boxes, so whatever processes they have in place for their outputs are usually beyond our reach, and oftentimes beyond our comprehension. Just like it is impossible to be sure what processes drive our decisions based on our brain patterns, it is perhaps equally challenging to pinpoint how exactly the decisions of an A.I. are forged. That’s something that is probably not properly communicated to the decision makers on A.I. matters, along with the fact that AIs cannot undertake responsibility for these decisions, no matter how sophisticated these marvels of computing are.
Perhaps some more education and investigation into the nature of A.I. and its limitations is essential for everyone who has a say in this matter. It would be irresponsible to expect one set of people to navigate through this on their own and then blame them if their decisions are not good enough or able to withstand the test of time. This is a matter that concerns us all and as such we all need to think about it and find ways to contribute to the corresponding decisions. A.I. can be a great technology and integrate well in the job market, if we approach it responsibly and with views based on facts rather than wishful thinking.
The 5 Levels of Aptitude in Data Science – Useful Insights for both Data Scientists and People Managing Data Science Teams
Recently this article found its way to my Pocket feed. Being a 22 minute read, I was hesitant about whether I should save it since I don’t have the time for such long articles. So, I decided to save it as an insomnia remedy, since I figured that it would probably put me to sleep before finishing it. However, it had the opposite effect! What’s more, it urged me to think about how all this applies to data science and how it can help all those related to the field (be it as practitioners of data science, or people handling data science resources).
The author of the article highlights 5 categories of professionals, based on their aptitude level, the 5 levels of expertise as he calls them. I prefer to avoid the term expertise since some can be an expert in one aspect of data science and yet not be an expert in the field overall. Aptitude sounds more appropriate but if you find that another word is more suitable to describe general competence in a field, please let me know. So, the 5 levels of aptitude are:
I would very much like to go into the details of each one of these levels, like the author of that article did, but I’d rather cover this topic in a video, if there is sufficient interest for it. One point I’d like to make, however, which may not have been conveyed clearly in the original article is the usefulness of this classification. Whatever point you are in your data science career, it is important to be fully aware of where you stand and what you need to do in order to better yourself. This can become apparent if you contemplate on this taxonomy and be honest with yourself. If you are in the lookout for hiring a data scientist, this is useful for you too, since the role you wish to fill has more to do with a sense of aptitude and responsibility, rather than merely a set of skills and/or X years of experience in the field. This way, not only will your hire be a good investment as a resource, but may help clarify what data science can do for your organization.
One last thing. It is important to remember that no matter what category you fall into in this taxonomy, there is always room for improvement. Even an expert has things to learn, so keep an open mind about what data science has in store for you. Let’s remember one of the great Rennaisance master artist / sculptor Michalangelo, when he said, while in his advanced years, “I’m still learning...”
Lately everyone likes to talk big picture when it comes to data science and artificial intelligence. I’m guilty of this too, since this kind of talk lends itself for blogging. However, it is easy to get carried away and forget that data science is a very detailed process that requires meticulous work. After all, no matter how much automation takes place, mistakes are always possible and oftentimes unavoidable. Even if programming bugs are easier to identify and even prevent, to some extent, some problem may still arise and it is the data scientist’s obligation to handle them effectively.
I’ll give an example from a recent project of mine, a PoC in the text analytics field. The idea was to develop a bunch of features from various texts and then use them to build an unsupervised learning model. Everything in the design and the core functions was smooth, even from the first draft of the code. Yet, when running one of the scripts, the computer kept running out of memory. That’s a big issue, considering that the text corpus was not huge, plus the machine used to run the programs is a pretty robust system, with 16GB of RAM, while it’s also running Linux (so a solid 15GB of RAM are available to the programming language to utilize as needed). Yet, the script would cause the system to slow down until it would eventually freeze (no swap partition was set up when I was installing the OS, since I didn’t expect to ever run out of memory on this machine!) Of course, the problem could be resolved by adding a swap option to the OS, but that still would not be a satisfactory solution, at least not for someone who opts for writing efficient code. After all, when building a system, it is usually built to scale well and this prototype of mine didn’t look very scalable. So, I examined the code carefully and came up with various hacks to manage resources better. Also, I got rid of some unnecessary array that was eating up a lot of memory, and rerouted the information flow so that other arrays can be used to provide the same result. After a couple of attempts, the system was running smoothly and without using too much RAM.
It’s small details like these that make the difference between a data science system that is practical and one that is good only on the conceptual level (or one that requires a large cluster to run properly). Unfortunately, that’s something that is hard to learn through books, videos, or other educational material. Perhaps even conventional experience may not trigger this kind of lesson, though perhaps a good mentor might be very beneficial in such cases. The morale of the story for me is that we ought to continuously challenge ourselves in data science and never be content with our aptitude level. Just because something runs without errors identifiable by the language compiler, doesn’t mean that it’s production-ready. Even in the case of a simple PoC, like this one, we cannot afford to lose focus. Just like the data that is constantly evolving into more and more refined information, data scientists follow a similar process, as we grow into more refined manifestations of the craft.
Kaggle is an online community (social medium?) geared towards data science enthusiasts. It aims to link data analytics problems with data analysts with a flair for machine learning and other data science methodologies. I’ve even mentioned in in my books and videos at times, as a potential resource for gaining some traction with the data modeling aspect of the field and to get people more interested in the craft. However, some people confuse Kaggle experience with data science experience. Let’s delve more into this matter.
Kaggle problems tend to be geared towards analysts, not data scientists. The whole nature of a competition is one-sided (aiming to optimize a particular evaluation metric), which is not the same as creating a successful data science product. Although the model in the back-end of a data science product has to be somewhat accurate, there are other factors involved that are equally important (if not more important) than the model’s raw performance. For example, the amount of resources it uses, its interpretability, how easy it is to use, its maintenance costs, and how compatible it is with existing technologies, are all factors that are relevant when building a data science model.
What’s more, the data in Kaggle competitions is very much like baby formula. In the real world, data is far more complex, more noisy, and stemming from a variety of sources. So, if someone can handle Kaggle data, that’s great, but it’s not the same as handling real-world data, no matter how robust his models are. In fact, many people argue that around 80% of the data scientist’s work, time-wise, is getting the data clean and ready for the data model.
Of course, it’s not just Kaggle that provides this false image of data science, there are other platforms that are similar. The UCI repository, for example, is similar, though in all fairness, they state that their datasets are there for research purposes. Still, they lend themselves for practicing data modeling and trying out heuristics and new data analytics or A.I. algorithms.
So, when it comes to data science experience, it is important to remember that data modeling is just one part of it. It is an important part, but not the whole picture. If you want to gain real-world experience, you are better off getting some data from real-world sources, such as social media feeds, sensor data from IoT devices, etc. rather than Kaggle competitions. The latter are fun, but data science is not all fun and games. Much like hackathons are great for practicing coding, when it comes to doing programming in the real world, you need more than just hackathon experience. So, why would data science be any different?
Short answer: no. Long answer: although some experience is positively correlated with aptitude in the field, the relationship between the two is neither linear, nor straight-forward. Let’s delve into this more, examining the lesser known aspects of it.
If someone is in the beginning of their career as a data scientist, chances are that having some experience is much better than no experience at all. The experience in this case involves dealing with practical challenges that are usually not described in data science books or courses, so for the inexperienced data scientist, these can be major liabilities in his work. The experienced data scientist has encountered tricky situations where the models she has built have failed and she has a better chance of avoiding similar situations, or at the very least tackling them efficiently when they occur. Do additional years of experience help a data scientist though in her career? It depends. Unmistakably, that additional experience of working in an organization allows the professional to cultivate his soft skills more and be able to work more effectively in a team. Also, his understanding of how a business works becomes more solid and functional. However, data science aptitude does not necessarily grow as the years of experience accumulate. After all, the field changes so rapidly, so having a few years more experience in it may be irrelevant, as the techniques the more experienced data scientist has mastered, may not be so useful or necessary any more.
Of course there are exceptions. If a data scientist is particularly good, due to talent, education, or some combination of the two, then the additional years of experience are going to translate into a more varied expertise and perhaps the ability to lead a team effectively. The thing is that this kind of person is going to be good even with little or no experience, since the innate talent or general aptitude due to good education are there from the get-go. Naturally these cases are few and may be considered outliers, but they are relevant enough to be valuable as they are the exception that verifies the rule.
So, what would be a good proxy for data science aptitude then, if experience is not a good enough feature to predict this valuable variable? Well, it depends on the situation. If you have an organization that deals with text a lot and requires a data scientist to be part of NLP and NLU projects, then some understanding of the language(s) and/or the ability to create and implement scalable heuristics based on text data would be very valuable. These skills would be a better proxy than having spent a number of years on the field, focusing mainly on recommender systems, for example. If an organization wants someone to work on image data and solve challenging problems related to that (e.g. object identification), then a solid understanding of image data or of deep learning techniques would be a pretty good proxy of aptitude related to this task.
Work experience has remained relevant because of its applicability in various professions. However, making the inference that just because it works well with them it should also work in data science is unscientific and reckless, at best. So perhaps organizations that value experience so much are better off being avoided since it’s doubtful that they have a solid understanding of data science, or the ability to manage this kind of resources effectively (perhaps their managers need to gain some more experience in handling certain human resources, who knows?). After all, just because most organizations can benefit from data scientists, it doesn’t mean that they are data science ready.
“Wait a minute! Isn’t data science all about cool machine learning models, number-crunching, artificial intelligence methods, and big data?” I can hear some people saying. Well, it is all that, but the one thing that binds all these different aspects of data science together is domain knowledge, or in other words, context. You may be adept in cleaning, structuring, and modeling the data at hand but if you are missing the bigger picture and how all this data (and its distillation) relates to the stakeholders of the project, then you are just an analyst! Data science is not divorced from the real world, even if in its most esoteric aspects, it may seem quite alienating to the average Joe. Data science is a business framework, among other things, and as such it constitutes an integral part of business processes. Without the latter to provide a sense of perspective and some sort of objective to the data at hand, data science is reduced to an intellectual endeavor, like modern philosophy. Even if there is value in the latter too, but it’s not what data science is about.
Context in data science manifests on various levels. At the larger scale, it’s about its relevance to the end-user and the stakeholders of the project. Because no matter how brilliant a data model is, it is wrong, as it is merely an abstraction of reality, though, if the model is crafted in a way that it provides value to the end-user, it can be useful (as George Box would put it). This value stems from the context it takes into account. Yet, context also manifests in the way the data is engineered and distilled into information. For example, there are a number of ways to do dimensionality reduction (i.e., make the number of features smaller, while in some cases making these features more compact). If you follow a recipe book blindly, you’ll probably resort to PCA, ICA, or some other off-the-shelf method. However, if you look at the problem more closely, you may employ a different strategy, particularly if you have label data at your disposal. Such additional information may impact the way the feature data is perceived and make a feature filtering approach more relevant, for example.
Perhaps it would be prudent to put data science into perspective, rather than focus on its techniques and tools only. Being mindful of the context of every part of data science pipeline is a great way to accomplish that. After all, just like every applied science, data science is geared towards people, not abstract entities to populate theories and research articles. The latter are useful, but the former are what provide our craft with meaning and business value.
A.I. is great, especially when applied to data science. Many people lately are quite concerned about the various dangers it may entail. This naturally polarizes people, splitting views of the topic into two main groups: the ones neglecting these concerns and those mirroring a fear that the end of the world is upon us. Probably the truth lies somewhere in-between, but given the lack of evidence, any speculation on the matter may be premature and likely to be inaccurate.
In this post I’d like to focus on another danger that many people don’t think much about, or don’t see it as a danger at all: the sense of complacency that may arise from a super-automated world. Of course, complacency is a human condition and has little to do with A.I. but someone may consider that it is A.I. to blame for this condition. After all, super-automation may be possible only through this new technology becoming wide-spread.
This danger, which can find its way to data science too if left unchecked, is a real one. However, it is neither singular nor catastrophic. After all, every large-scale technological innovation has brought about social changes that have triggered this condition to some extent. This does not mean that we should go back to the stone age, however. After all, technology is largely neutral and the people who make it available to the world have the best intentions in mind. So, it seems that blaming a new tech for this matter may be a bit irresponsible.
Yet, the advent of technology can be a good thing if dealt with in a mature manner. Just like you can own a car and still make time for physical exercise, you can have access to an A.I. and still be a creative and productive person. It’s all a matter of power, at the end of the day. If we give away our power, our ability to choose and to shape our lives, then we are left powerless victims of whoever has taken hold of that power. In the case of A.I., if we cherish automation so much that we outsource every task to it, then we are willingly creating our own peril. So, if we choose to maintain a presence in all processes where A.I. is involved, the latter is not going to be a threat, not a considerable one anyway.
There is no doubt that A.I. can be dangerous, much like every other technological advancement. However, it seems that the crux of the problem lies within us, rather than at the machines that incarnate this technology. If we give into a sense of complacency and allow the AIs to have a gradually more active part in our society, then maybe this tech will create more problems than the ones it’ll solve. However, if we deal with this new technological advent maturely, we can still benefit from it, without making ourselves obsolete or irrelevant, in the process.
When people think about the benefits of A.I. and its impact in our world, they usually think of self-driving cars, advanced automations, deep learning systems, clever chatbots, etc. Those particularly infatuated with the idea of A.I. tend to go even further and fantasize about super-intelligent machines that will magically solve all our problems without any effort from us (pretty much like a deus ex machina figure in some ancient theater play). However, the more pragmatic A.I. thinkers focus more on particular applications of A.I. that can be implemented fairly easily, and that target specific issues that would be impractical to solve in conventional ways. One such case is that of detecting how contaminated beehives are by a particular parasite.
Why should we care about this matter? Don’t we have larger problems to deal with? Perhaps. After all, there are more evident problems out there that require unconventional ways of tackling them, problems that could benefit a lot by a narrow A.I. designed for them. However, the issue of infested beehives is not a minor one, as it represents a real danger for the whole species of these buzzing insects. It’s worth noting that bees are not useful for just the honey they produce; they are key in plant polination, and as such they play an important role in our planet’s fragile ecosystem, that’s on the wane lately. So, it may be a big deal after all.
Developing an A.I. to tackle the beehive infestation problem is a project disproportionate to its impact, as it is fairly manageable with the existing technology, at least for a particular parasite, called the Varroa mite. These organisms can cause serious issues to the bees, issues that are observable with the naked eye. However, assessing the infestation may not be so straight-forward, making it difficult to take intelligent action against it (e.g. how can you tell which beehives are in imminent danger and prioritize accordingly?). That’s where Computer Vision comes in handy, an automated way for a computer system to evaluate what a camera attached to it observes. The images from the camera feed, when coupled with some deep learning network, can help measure the magnitude of the issue in a very small amount of time (check out a demo of an app by TopLab, that does just that). Will this be enough? Possibly, if this process is coupled with an effort to eliminate the parasites once identified. However, knowing about the infestation issue in an objective and practical manner, can definitely speed things up.
Perhaps A.I. is not as futuristic as it is often perceived, nor as high-level as it comes across. After all, just like any other applied science, it aims to solve real-world problems right here and now, in an efficient and effective manner. The question is, are we willing to apply it to more strategic problems, like the case of an impaired ecosystem, or are we going to use it only to make our urban lives more convenient? Hopefully that’s a question we can answer with just our natural intelligence...
Sentiment Analysis is a popular NLP topic that I've been involved in for a while now. I even wrote an article about it for a friend of mine, who is an editor at a marketing blog. Anyway, after I finally finished my latest book (Technics Publications, ETA: Fall 2017), I had some time to work on a video for Safari Books Online. This video is now online at Safari and is probably going to be followed by similar ones on NLP and NLU related topics. Any suggestions are welcomed!
When MAXset did its debut many people saw its uniqueness and value, though only few understood its potential. This is understandable, considering that its approach in text analytics is quite different to anything else that is out there. Also, it claims really high throughput, particularly on large corpora of text, something particularly useful for enterprises and other organizations that have lots of text data in their pipelines. Moreover, data processed by MAXset tends to be more comprehensive, so it doesn’t need some fancy data product in order to be of use to the end-user (though it lends itself for data science work too).
All this may sound nice, but does it really deliver on its promises? Here is a short video that demonstrates the core of its MVP, the transformation of a corpus into a delimited file that can be opened with any spreadsheet application. Feel free to check out the corresponding video.
What I find impressive in all this is that this MVP was developed under very adverse circumstances, with a shoestring budget, minimal computing resources, and within about 6 months. It is really amazing what can be accomplished if a team is motivated by leadership and a belief in the end-result, rather than bureaucracy and a benefits package. Also, it’s companies like this that make data science truly useful, as the end-user is not some expert in data analytics, but anyone interested in getting to the gist of the data at hand, be it a program manager, a business analyst, or a client.
MAXset is still in the beginning of its cycle. As more people become aware of it and the benefits it can offer them, it is only a matter of time before it becomes a more recognized brand in the text analytics sector. Coupled with that, its integration of additional A.I. techniques in its products is bound to make it even more relevant in today’s data-driven world. For more information, feel free to contact MAXset directly at firstname.lastname@example.org.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.