Lately I've been thinking about A.I. and Statistics a lot (you could say that the amount of time spent thinking about these topics is significantly higher at alpha = 0.05!). This is partly because my Stats article managed to get more traction than any other article I've written in the past few months, and partly because A.I. is becoming more and more relevant in our field. So, the question of whether A.I. is one day going to replace Stats altogether remains a very relevant one.
The key advantage of A.I. methods is that they are assumption-free. This by itself enables them to tackle the problems they are aiming to solve, in a very methodical and efficient way. Of course, certain assumptions might speed things up, but they might obstruct the discovery of the optimal solutions to the problem at hand. Statistical inference models lost the war against machine learning models because of that, especially when artificial neural networks (ANNs) entered the scene. Also, the fact that many ML models could be combined in an ensemble setting allowed them to become even more robust, attaining F1 scores that were unfathomable for statistical prediction models. So, the possibility of other methods of statistics becoming outsourced to alternative systems is quite real.
On the other hand, statistics are very easy to use and interpret, since most of them were designed from a user’s perspective. There are doctors out there (the medical kind), who don’t know much about data analytics but can easily work a statistical model for figuring out if a certain drug has a positive influence on certain patients, and derive some scientific conclusions based on that. That doctor may not be able to write a script to save his life, but he can make use of the data he gathers and advance his scientific field, using just statistics. It’s quite unlikely that this kind of person, who is usually too busy or just not technically adept enough, will take up an A.I. approach to this kind of analysis any time soon.
Of course, A.I. constantly evolves so the black-box issue that makes many ANN-based systems unfavorable, may wane in the future. Already there are A.I. professionals talking about A.I. systems that offer some kind of interpretability. So, even if statistical systems are easier to understand and communicate, it could be that A.I. hasn't said its final word yet.
Whatever the case, I prefer to remain agnostic on this matter. Just like with programming, it’s best to keep one’s options open, when it comes to data science. I’m not a fan of statistics (and never was), but I see value in them and I’m happy to use them to the extent that they offer value to the projects I work on. A.I. may be more of a novel and exciting framework, but if an A.I. system is hard to communicate to the client, or doesn't lend itself to interpretation, then I may not use it everywhere. Just like you don’t take your fancy fringe science book to the beach, you don’t need to show off your A.I. know-how at every opportunity. Perhaps the humble historic novel is more suitable for reading while sunbathing, just like the humble statistics are fine for describing if sample A is significantly different from sample B.
Recently someone on LI recommend that I bring more JOY to the world instead of merely complain about it (I wasn’t complaining but apparently she thought I were!). I’m not an entertainer, nor a psychology expert, but perhaps you don’t need to be in these lines of work in order to bring joy to the people you interact with. I thought about it and decided that perhaps data science could be a source of joy to other people. However, for this to happen, it needs first and foremost to be joyful to you.
Deriving joy from a challenging and oftentimes frustrating procedure such as a data science project is not easy. In fact, many people can’t stand that largest part of the work such a project entails. However, with the right mindset, even the more tedious aspects of the work can be enjoyable (i.e. be conducive to joy). So, what is this mindset that turns boredom to beauty and drudgery to delight?
Although there is no magic formula for making things more enjoyable in data science, if you have the attitude of the data science amateur when you approach a problem, your chances of enjoying it are better. This doesn’t mean being sloppy and checking Stackoverflow or Quora every 5 minutes. The amateur’s attitude is, as the word amateur implies, an attitude based on love for what you are doing. The amateur doesn’t care if they get paid for their work. They may even never get paid, but they do it anyway because they find it fulfilling. It’s like a hobby for them.
However, a data scientist still needs to be professional about her work. There are deadlines, meetings with stakeholders, and of course debugging scripts that throw errors at the worst possible time! Handling these matters takes professionalism, but it doesn’t need to be a mechanical and draining process. If you see part of your work as a data scientist (even the debugging stage) as a learning experience and have what is known in Zen as the beginner’s mind, you are bound to find everything a bit more enjoyable. It’s the joy that comes from detachment and lack of rigid expectations from your work, something that every professional knows.
Remembering all this, especially on a Monday morning, is not as straight-forward as it may seem when you think of it. However, being joyful is a matter of perspective and at the end of the day a matter of habit. Aristotle famously said that “virtue is a matter of habit” and some could argue that joy is a kind of virtue. Maybe not something you would put on your resume or talk about in an interview, but definitely something worth keeping in mind in those long mornings when you may be tempted to question your career choices. After all, if you could be joyful about data science as a field once, you can be joyful about data science work too. And if you still feel that you need some help to get your enthusiasm flowing, invigorating a joyful mindset, you can always read my book Data Science – Mindset, Methodologies, and Misconceptions. :-)
After several days being in limbo, the video "Remaining Relevant in Data Science" that I've made recently, is now online on Safari (link). If you have a subscription to that platform, do check it out. If you prefer to access this kind of knowledge through a different medium, feel free to check out the last chapter of my latest book, Data Science Mindset, Methodologies, and Misconceptions. Enjoy!
There is a lot of unstructured data out there. Many people view it as untapped potential, and they are right. There is a lot of signals out there, waiting to be harnessed by the data scientists who get to them. However, most of the data where these signals dwell is unstructured, or semi-structured (there is some structure to them but it’s not consistent). This leads some people to believe that structuring it will instantly make the data more valuable. This view is quite debatable, however, and is worth exploring further, before it brings about unrealistic expectations of what data science can do.
Structuring data is part of the data science process. Before we can feed it to a model, we need to get the data into the form of a matrix (if all the data is of the same type) or a data frame (whenever we have various types in the dataset). However, the fact that structuring data is necessary for the mining of the information in it (usually in the form of insights), does not make it a sufficient condition for that. In other words, we have to structure the data, but this doesn't guarantee anything. There have been many times when upon training various models, from different frameworks, things don’t seem to pan out. The performance is mediocre, the results are not actionable, and the whole thing is labeled as a failure of sorts. I do not mean to dismay anyone, but it’s healthy to be aware of this possibility, since it’s not often shown in data science books or tutorials. People like to talk about the success stories, leading to a false understanding and unrealistic expectations.
For the data to be valuable, it needs to have a strong signal in it. This means that even by just looking at it, you can tell that there is something there that given enough time and effort, you would be able to find yourself. In this case, data science facilitates the process of mining that signal, since no-one has the patience or the resources to go through a data stream on there own, no matter how motivated they are. In this case, data science is bound to be successful, since it accelerates the process of turning this information-rich data into actual information, or even knowledge. However, the structure of the data is not so relevant in this case. Even if the data is in a JSON or raw text format, for example, it can still be useful, since it’s not too difficult to generate features that penetrate this nebulous form and manage to encapsulate the essence of it, in a form that can easily fit into a database table (albeit a very large one usually).
So, it is important to exercise discernment in this matter. Surely structured data may be more appealing for a data scientist, as it means less tedious work for her, but it doesn't guarantee anything of value. Besides, the process of structuring the data (aka data engineering) can be insightful too, as it involves some data exploration. Data exploration may not always accelerate the structuring of the data, but it definitely helps you understand it better and make more informed choices about the whole data science process (including structuring). After all, shortcuts in the process may save you some time, but if you know what you are doing, you can definitely do without them, saving your organization some money in the process, since automated data structuring is not free. The choice is yours.
Short answer: yes. Longer answer: definitely, as long as they make a conscious effort to cultivate the necessary parts of this mindset and integrate them into a functional whole. Easier said than done, right? Perhaps. Maybe that’s why some companies ask for someone that has 15+ years of experience in the field, even if the field didn't exist 15 years ago! What they may really be asking is for someone who knows what this field entails and knows how to make things happen, using the corresponding methodologies. So, the question that naturally arises is “how can someone get this understanding of the field without having to spend a large part of their career in it?”
There are several strategies to accomplish that, none of which are easy or something that you can learn in a bootcamp. Even really good data science courses, may not be sufficient for this purpose. The reason is that the mindset of a data scientist is very diverse and not something you can put into a syllabus. There is a reason why the brightest data science practitioners seek a mentor, or some kind of personal learning experience, in order to gain some kind of mastering of the craft. Yet, as I’ve explained in the Mentoring in Data Science video, the mentor is not there to answer all your questions, even if he could answer most of them. The role of the mentor is to help you become your own mentor eventually. Of course there are exceptional people out there that don’t require a mentor, since they know everything they need to know, or they have the resources and resourcefulness to obtain this knowledge on their own. When I meet one such person I’ll be sure to blog about them!
Apart from being part of a mentorship, you can learn about the mindset of the data scientist by practicing science, in a data analytics setting. This is quite different from taking this or the other tool, applying it, and then creating some insightful visuals from the results. Practicing science also involves conducting experiments, asking deep questions, and challenging yourself and what you know. It’s realizing that all scientific theories are disprovable and not taking anything as gospel, since you are secure in the knowledge that everything in science is in flux. The only thing that’s perhaps immune to this constant change, is the mindset, the essence of the role of the data scientist. One robust way to attain this understanding is to strip away all the transient aspects of the role, one by one, through scientific research. In other words, you need to become the craft, rather than merely practice it like a technician of sorts.
In my latest book I underline several aspects of the data science craft that I’ve found, through both experience and research. They are relevant and useful for bringing about the data science mindset in someone. Of course, it is next to impossible to cover all the angles in a single book, but it is a good start. Applicable to all levels of data science practitioners, this book can at the very least make you fascinated about data science and motivate you to learn more about it, without getting consumed by the techniques or the aspects of it that are more in vogue these days (e.g. artificial intelligence). After all, just like everything else in science, data science is more of a process than anything else. It’s up to you to make it an insightful and intriguing one...
Geometry is probably one of the most undervalued aspects of Mathematics. So much so, that people consider it something that is relevant mainly for those pursuing that particular discipline, as in their minds geometry is divorced from other, more practical fields, such as data analytics. However, geometry has always been an applied discipline, intertwined with engineering. As data science and data analytics in general is closely linked to engineering, at least in certain principles, it makes sense to at least consider the relationship between geometry and data analytics.
Geometry involves the study and use of visual mathematical concepts, such as the line, the circle, and other curves, to solve various problems or prove relationships that may be used to solve other, more complex problems. The latter are referred to as theorems and are the core of the scientific literature of geometry. So, unlike other more theoretical parts of mathematics, geometry is practical at its core since it endeavors to solve real-world problems. Although the latter have become increasingly sophisticated since geometry was in its glory days (antiquity), many problems todays still rely on geometry for their solution (e.g. the field of optics, the calculation of trajectories of rockets, and more). Besides, since the times of Descartes, the famous philosopher-mathematician, geometry has become more quantifiable, particularly with his invention of analytical geometry.
Data analytics is in essence a field of applied mathematics, with an emphasis on numeric data, the kind that features heavily in geometry. Although direct connections between the shapes and the proportions of geometry with the data analytics concepts are few and far in between, the mindset is very similar. After all, both disciplines require the practitioner to find out some unknown quantity using some known data, in a methodical and logical manner. In geometry, these correspond to a particular point, shape, or mathematical relationship. In data analytics, these are variables that take the form of features (through refinement, selection, and processing in general) and target variables. Of course, data analytics (esp. data science), has a variety of tools available that facilitate all these, while in geometry it’s just the practitioner’s imagination, a pencil, some paper, and a couple of utensils. However, the mental discipline behind both fields is of the same caliber, while creativity plays an important role in both.
I’m not saying that geometry alone will make someone a good data analytics professional, or that you should give up your data science courses to take up geometry. However, if you have the time and you can also see something elegant in geometry problems, then it can be a very useful past-time, much more useful than other, strictly analytical endeavors. After all, imagination hasn't gone out of fashion, at least not in the applied sciences, so anything that can foster this faculty, while at the same time encourage mental discipline, is bound to be helpful. As a bonus, spending time with geometry is bound to help your visualization skills and enable you to view certain data analytics problems from a different angle (no pun intended). Besides, the same mindset that helped people build pyramids and accomplish several other architectural feats, is what forged many modern algorithms in machine learning, for example, turning some abstract idea or question into something concrete and measurable, be it a design or a process. Isn't that one of the key attributes of a data analytics project?
Lately everyone likes to talk big picture when it comes to data science and artificial intelligence. I’m guilty of this too, since this kind of talk lends itself for blogging. However, it is easy to get carried away and forget that data science is a very detailed process that requires meticulous work. After all, no matter how much automation takes place, mistakes are always possible and oftentimes unavoidable. Even if programming bugs are easier to identify and even prevent, to some extent, some problem may still arise and it is the data scientist’s obligation to handle them effectively.
I’ll give an example from a recent project of mine, a PoC in the text analytics field. The idea was to develop a bunch of features from various texts and then use them to build an unsupervised learning model. Everything in the design and the core functions was smooth, even from the first draft of the code. Yet, when running one of the scripts, the computer kept running out of memory. That’s a big issue, considering that the text corpus was not huge, plus the machine used to run the programs is a pretty robust system, with 16GB of RAM, while it’s also running Linux (so a solid 15GB of RAM are available to the programming language to utilize as needed). Yet, the script would cause the system to slow down until it would eventually freeze (no swap partition was set up when I was installing the OS, since I didn’t expect to ever run out of memory on this machine!) Of course, the problem could be resolved by adding a swap option to the OS, but that still would not be a satisfactory solution, at least not for someone who opts for writing efficient code. After all, when building a system, it is usually built to scale well and this prototype of mine didn’t look very scalable. So, I examined the code carefully and came up with various hacks to manage resources better. Also, I got rid of some unnecessary array that was eating up a lot of memory, and rerouted the information flow so that other arrays can be used to provide the same result. After a couple of attempts, the system was running smoothly and without using too much RAM.
It’s small details like these that make the difference between a data science system that is practical and one that is good only on the conceptual level (or one that requires a large cluster to run properly). Unfortunately, that’s something that is hard to learn through books, videos, or other educational material. Perhaps even conventional experience may not trigger this kind of lesson, though perhaps a good mentor might be very beneficial in such cases. The morale of the story for me is that we ought to continuously challenge ourselves in data science and never be content with our aptitude level. Just because something runs without errors identifiable by the language compiler, doesn’t mean that it’s production-ready. Even in the case of a simple PoC, like this one, we cannot afford to lose focus. Just like the data that is constantly evolving into more and more refined information, data scientists follow a similar process, as we grow into more refined manifestations of the craft.
“Wait a minute! Isn’t data science all about cool machine learning models, number-crunching, artificial intelligence methods, and big data?” I can hear some people saying. Well, it is all that, but the one thing that binds all these different aspects of data science together is domain knowledge, or in other words, context. You may be adept in cleaning, structuring, and modeling the data at hand but if you are missing the bigger picture and how all this data (and its distillation) relates to the stakeholders of the project, then you are just an analyst! Data science is not divorced from the real world, even if in its most esoteric aspects, it may seem quite alienating to the average Joe. Data science is a business framework, among other things, and as such it constitutes an integral part of business processes. Without the latter to provide a sense of perspective and some sort of objective to the data at hand, data science is reduced to an intellectual endeavor, like modern philosophy. Even if there is value in the latter too, but it’s not what data science is about.
Context in data science manifests on various levels. At the larger scale, it’s about its relevance to the end-user and the stakeholders of the project. Because no matter how brilliant a data model is, it is wrong, as it is merely an abstraction of reality, though, if the model is crafted in a way that it provides value to the end-user, it can be useful (as George Box would put it). This value stems from the context it takes into account. Yet, context also manifests in the way the data is engineered and distilled into information. For example, there are a number of ways to do dimensionality reduction (i.e., make the number of features smaller, while in some cases making these features more compact). If you follow a recipe book blindly, you’ll probably resort to PCA, ICA, or some other off-the-shelf method. However, if you look at the problem more closely, you may employ a different strategy, particularly if you have label data at your disposal. Such additional information may impact the way the feature data is perceived and make a feature filtering approach more relevant, for example.
Perhaps it would be prudent to put data science into perspective, rather than focus on its techniques and tools only. Being mindful of the context of every part of data science pipeline is a great way to accomplish that. After all, just like every applied science, data science is geared towards people, not abstract entities to populate theories and research articles. The latter are useful, but the former are what provide our craft with meaning and business value.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.