“I have never let my schooling interfere with my education.” (quote believed to be originally by Mark Twain)
People talk about education a lot these days, particularly in a data science setting. However, we need to discern between actual education and training. Both are essential, but it is the former that holds the most value. The latter is easier and oftentimes faster, but it may not be a good investment of your time if it is not accompanied by the former.
Education is all about mindset development and the ability to feel inspired from knowledge, thereby developing a healthy yearning for it. It is what happens when you teach a child how to play a game, or do a specific task. Although it’s more of a state of mind than anything else, education also has a formal aspect to it which is related to courses, seminars, workshops and talks, geared towards enhancing one’s understanding and comprehension of the topic at hand.
Training on the other hand is more geared towards techniques, methods, and the technical details of the topic taught. This is useful, of course, since every data scientist needs to know all these things. That’s why there are so many data science books and videos out there! However, knowing how to build an SVM or a neural network doesn’t make someone a competent data scientist. In fact, in some cases it doesn’t make him even an employable one.
Perhaps there is a reason why most companies require X years of experience in their recruits. Some things in data science you can only learn through time, by practicing them and by developing an intuition for the data and how it is processed. Although the idea that a data scientist has to have X years of experience to be worthy is something that remains debatable (why X and not Y?), this trend shows that hiring managers can spot a difference between someone who knows data science from a book (or videos) and someone who knows the craft because she has worked the data and has developed a bunch of models, through lots of trials and the inevitable mistakes that ensue.
Education is therefore something that can be attained through experience, not just reading and watching data science material on the Safari platform. The latter can be a great start, but you still need to get your hands dirty and also think about the whole thing, instead of just following recipes, from a data science cookbook. It’s important to know techniques, no doubt, but unless you have developed an understanding that allows you to go beyond these techniques and explore alternative features and alternative models, you may never grow beyond the advanced beginner stage.
Even someone who has spend most of his life in data science can still learn about this field, as it's a) very diverse and wide-spread, and b) always evolving. Personally, I still find that I’m learning new things as I delve deeper into the field and as I converse with other data scientists and A.I. professionals, of all levels. This too can be a form of education, not any less valuable than the education of creating a new data analytics method, or a new data product. The moment someone starts looking down on education and thinks that he knows “enough” is the moment he begins becoming obsolete.
Sometimes it’s easy to get carried away and focus on data science too much, losing sight of the applications of it. Although this is something somewhat common in an academic setting (particularly in universities that don’t have any ties to the industry), it may happen in companies too. When this happens, it’s usually best to walk away, since data science without any real-world application can be problematic.
Data science and A.I. that’s geared towards data analytics, involve a lot of scientific methodologies, which are quite interesting on their own. This may urge someone to get lost in that aspect of the craft and neglect the application part, particularly the one where these methodologies are employed for solving real-world problems. That’s not to say that doing data science research is bad. Quite the contrary. However, when the research is without any application, focusing too much on the math side of things, it is bound to be a waste of resources (unless you are doing this as part of a research project, e.g. for a research center or a university, in which case this is expected). The reason is that data science is by definition an applied field, much like engineering. Particularly when it is undertaken by a company (e.g. a startup), it needs to be able to deliver something concrete, and more importantly, something useful.
It’s hard to over-estimate the value of this aspect of data science that has to do with the end-user. After all, this person is often the one paying the bills! Also, focusing on the application part of the craft enables something else too: the more practical implementation of the technologies developed and the inception of new methods that are more hands-on and therefore useful. This is one of the reasons that data science has veered away from Statistics, a field which is by its nature more theoretical and more math-y than applied Science. That’s also the main reason why data science involves a lot of programming, oftentimes building things from scratch, even if it’s simple scripts. That’s quite different than using an all-in-one software package, like SAS or SPSS, where the user merely calls functions and does rudimentary data processing.
You can come up with ingenious methods in data science, that would be able to fetch a journal publication or two. However, if these methods don’t add value to an organization, they are not that great, from a holistic standpoint. This is observed in other parts of Science too, e.g. Electromagnetism. Despite the various theoretical aspects of that field, its usefulness is also apparent. People who practice this part of Physics tend to be very practical and oftentimes come up with interesting inventions that add value to their user (e.g. in the case of electromagnets, or power transformers). Data science is not any different.
All the clever mathematics behind a method may be enchanting for the mind, but it’s when this method is put into practice and yields some oftentimes actionable insight when it really becomes meaningful. That’s something worth remembering, since it’s easy to lose sight of the questions we are trying to answer, and focus too much on the possibilities that we discover. And some may argue that it’s the journey that matters, but for a journey to be a journey there needs to be a destination. The latter is usually some person who doesn't care much about the science behind the insights, but more about their applicability and usefulness. Companies like MAXset LLC may be completely ignorant of that, but this doesn't make it a viable strategy. On the other hand, companies that have a chance of providing true value to the world make the business aspect of the craft their priority.
It is easy to fall into this misconception of believing that in data science we are all solitary people doing our work and interacting only in the workplace and in the social media. Perhaps we are part of some data science team, but still feel we are still on our own when it comes to our relationship with the field. However, this is just one of many possibilities in how we relate to the data science world, and it is definitely not the best one.
Being part of a community in data science is not only possible but also necessary. Of course just networking with other data scientists may not be enough, but it is often a good starting point. This is particularly important towards the beginning of one’s career. After all, not even the best data science books can give someone solace in times of difficulty or doubt. That’s when having a good mentor comes in very handy. After all, even if that mentor is a bit aloof and preoccupied with his own stuff, he tends to have a genuine interest in your career and is motivated to help you out, at least to some extent. This can be another step towards becoming part of a community of data science professionals.
Make no mistake, however. Neither the mentor, nor anyone else is going to fight your battles for you. The other data scientists, be it professional acquaintances, mentors, or teammates, have their own battles to tackle. However, they may be able to offer you advice or help you gain insight to solutions that you couldn't think of by yourself, especially during the time you are immersed in the problems you are tackling.
Finding a physical community may not always be possible. Not all cities are as advanced as the ones where the field thrives and has a cohorts bustling with data science events and activities. However, data scientists are out there who are also in need of a community, so it’s only a matter of time before you find them. Perhaps you’ll “meet” them online, through some social network or a data science forum. Maybe you’ll encounter them in a data science conference, or a webinar. Bottom line, if you are open to finding a community of data scientists, the opportunities to do so will manifest, sooner or later.
Being part of a data science community is not only to help you in difficult times though. It’s also a great accelerator for developing yourself as a data scientist through being exposed to new trends, novel approaches to known problems, and most importantly, to unknown problems that you’d probably not encounter on your own, even if you work in a data-driven company. All that is bound to foster in you the knowledge and know-how you need to advance to the next level, whatever that level is for you. At the same time, it can help you maintain your enthusiasm for data science, and perhaps even make you more zestful about the field. After all, it is usually the people who are passionate about something that make the most progress in it and are also consistent in do so. Data science is not any different in that respect.
Lately I've been thinking about A.I. and Statistics a lot (you could say that the amount of time spent thinking about these topics is significantly higher at alpha = 0.05!). This is partly because my Stats article managed to get more traction than any other article I've written in the past few months, and partly because A.I. is becoming more and more relevant in our field. So, the question of whether A.I. is one day going to replace Stats altogether remains a very relevant one.
The key advantage of A.I. methods is that they are assumption-free. This by itself enables them to tackle the problems they are aiming to solve, in a very methodical and efficient way. Of course, certain assumptions might speed things up, but they might obstruct the discovery of the optimal solutions to the problem at hand. Statistical inference models lost the war against machine learning models because of that, especially when artificial neural networks (ANNs) entered the scene. Also, the fact that many ML models could be combined in an ensemble setting allowed them to become even more robust, attaining F1 scores that were unfathomable for statistical prediction models. So, the possibility of other methods of statistics becoming outsourced to alternative systems is quite real.
On the other hand, statistics are very easy to use and interpret, since most of them were designed from a user’s perspective. There are doctors out there (the medical kind), who don’t know much about data analytics but can easily work a statistical model for figuring out if a certain drug has a positive influence on certain patients, and derive some scientific conclusions based on that. That doctor may not be able to write a script to save his life, but he can make use of the data he gathers and advance his scientific field, using just statistics. It’s quite unlikely that this kind of person, who is usually too busy or just not technically adept enough, will take up an A.I. approach to this kind of analysis any time soon.
Of course, A.I. constantly evolves so the black-box issue that makes many ANN-based systems unfavorable, may wane in the future. Already there are A.I. professionals talking about A.I. systems that offer some kind of interpretability. So, even if statistical systems are easier to understand and communicate, it could be that A.I. hasn't said its final word yet.
Whatever the case, I prefer to remain agnostic on this matter. Just like with programming, it’s best to keep one’s options open, when it comes to data science. I’m not a fan of statistics (and never was), but I see value in them and I’m happy to use them to the extent that they offer value to the projects I work on. A.I. may be more of a novel and exciting framework, but if an A.I. system is hard to communicate to the client, or doesn't lend itself to interpretation, then I may not use it everywhere. Just like you don’t take your fancy fringe science book to the beach, you don’t need to show off your A.I. know-how at every opportunity. Perhaps the humble historic novel is more suitable for reading while sunbathing, just like the humble statistics are fine for describing if sample A is significantly different from sample B.
Recently someone on LI recommend that I bring more JOY to the world instead of merely complain about it (I wasn’t complaining but apparently she thought I were!). I’m not an entertainer, nor a psychology expert, but perhaps you don’t need to be in these lines of work in order to bring joy to the people you interact with. I thought about it and decided that perhaps data science could be a source of joy to other people. However, for this to happen, it needs first and foremost to be joyful to you.
Deriving joy from a challenging and oftentimes frustrating procedure such as a data science project is not easy. In fact, many people can’t stand that largest part of the work such a project entails. However, with the right mindset, even the more tedious aspects of the work can be enjoyable (i.e. be conducive to joy). So, what is this mindset that turns boredom to beauty and drudgery to delight?
Although there is no magic formula for making things more enjoyable in data science, if you have the attitude of the data science amateur when you approach a problem, your chances of enjoying it are better. This doesn’t mean being sloppy and checking Stackoverflow or Quora every 5 minutes. The amateur’s attitude is, as the word amateur implies, an attitude based on love for what you are doing. The amateur doesn’t care if they get paid for their work. They may even never get paid, but they do it anyway because they find it fulfilling. It’s like a hobby for them.
However, a data scientist still needs to be professional about her work. There are deadlines, meetings with stakeholders, and of course debugging scripts that throw errors at the worst possible time! Handling these matters takes professionalism, but it doesn’t need to be a mechanical and draining process. If you see part of your work as a data scientist (even the debugging stage) as a learning experience and have what is known in Zen as the beginner’s mind, you are bound to find everything a bit more enjoyable. It’s the joy that comes from detachment and lack of rigid expectations from your work, something that every professional knows.
Remembering all this, especially on a Monday morning, is not as straight-forward as it may seem when you think of it. However, being joyful is a matter of perspective and at the end of the day a matter of habit. Aristotle famously said that “virtue is a matter of habit” and some could argue that joy is a kind of virtue. Maybe not something you would put on your resume or talk about in an interview, but definitely something worth keeping in mind in those long mornings when you may be tempted to question your career choices. After all, if you could be joyful about data science as a field once, you can be joyful about data science work too. And if you still feel that you need some help to get your enthusiasm flowing, invigorating a joyful mindset, you can always read my book Data Science – Mindset, Methodologies, and Misconceptions. :-)
After several days being in limbo, the video "Remaining Relevant in Data Science" that I've made recently, is now online on Safari (link). If you have a subscription to that platform, do check it out. If you prefer to access this kind of knowledge through a different medium, feel free to check out the last chapter of my latest book, Data Science Mindset, Methodologies, and Misconceptions. Enjoy!
There is a lot of unstructured data out there. Many people view it as untapped potential, and they are right. There is a lot of signals out there, waiting to be harnessed by the data scientists who get to them. However, most of the data where these signals dwell is unstructured, or semi-structured (there is some structure to them but it’s not consistent). This leads some people to believe that structuring it will instantly make the data more valuable. This view is quite debatable, however, and is worth exploring further, before it brings about unrealistic expectations of what data science can do.
Structuring data is part of the data science process. Before we can feed it to a model, we need to get the data into the form of a matrix (if all the data is of the same type) or a data frame (whenever we have various types in the dataset). However, the fact that structuring data is necessary for the mining of the information in it (usually in the form of insights), does not make it a sufficient condition for that. In other words, we have to structure the data, but this doesn't guarantee anything. There have been many times when upon training various models, from different frameworks, things don’t seem to pan out. The performance is mediocre, the results are not actionable, and the whole thing is labeled as a failure of sorts. I do not mean to dismay anyone, but it’s healthy to be aware of this possibility, since it’s not often shown in data science books or tutorials. People like to talk about the success stories, leading to a false understanding and unrealistic expectations.
For the data to be valuable, it needs to have a strong signal in it. This means that even by just looking at it, you can tell that there is something there that given enough time and effort, you would be able to find yourself. In this case, data science facilitates the process of mining that signal, since no-one has the patience or the resources to go through a data stream on there own, no matter how motivated they are. In this case, data science is bound to be successful, since it accelerates the process of turning this information-rich data into actual information, or even knowledge. However, the structure of the data is not so relevant in this case. Even if the data is in a JSON or raw text format, for example, it can still be useful, since it’s not too difficult to generate features that penetrate this nebulous form and manage to encapsulate the essence of it, in a form that can easily fit into a database table (albeit a very large one usually).
So, it is important to exercise discernment in this matter. Surely structured data may be more appealing for a data scientist, as it means less tedious work for her, but it doesn't guarantee anything of value. Besides, the process of structuring the data (aka data engineering) can be insightful too, as it involves some data exploration. Data exploration may not always accelerate the structuring of the data, but it definitely helps you understand it better and make more informed choices about the whole data science process (including structuring). After all, shortcuts in the process may save you some time, but if you know what you are doing, you can definitely do without them, saving your organization some money in the process, since automated data structuring is not free. The choice is yours.
Short answer: yes. Longer answer: definitely, as long as they make a conscious effort to cultivate the necessary parts of this mindset and integrate them into a functional whole. Easier said than done, right? Perhaps. Maybe that’s why some companies ask for someone that has 15+ years of experience in the field, even if the field didn't exist 15 years ago! What they may really be asking is for someone who knows what this field entails and knows how to make things happen, using the corresponding methodologies. So, the question that naturally arises is “how can someone get this understanding of the field without having to spend a large part of their career in it?”
There are several strategies to accomplish that, none of which are easy or something that you can learn in a bootcamp. Even really good data science courses, may not be sufficient for this purpose. The reason is that the mindset of a data scientist is very diverse and not something you can put into a syllabus. There is a reason why the brightest data science practitioners seek a mentor, or some kind of personal learning experience, in order to gain some kind of mastering of the craft. Yet, as I’ve explained in the Mentoring in Data Science video, the mentor is not there to answer all your questions, even if he could answer most of them. The role of the mentor is to help you become your own mentor eventually. Of course there are exceptional people out there that don’t require a mentor, since they know everything they need to know, or they have the resources and resourcefulness to obtain this knowledge on their own. When I meet one such person I’ll be sure to blog about them!
Apart from being part of a mentorship, you can learn about the mindset of the data scientist by practicing science, in a data analytics setting. This is quite different from taking this or the other tool, applying it, and then creating some insightful visuals from the results. Practicing science also involves conducting experiments, asking deep questions, and challenging yourself and what you know. It’s realizing that all scientific theories are disprovable and not taking anything as gospel, since you are secure in the knowledge that everything in science is in flux. The only thing that’s perhaps immune to this constant change, is the mindset, the essence of the role of the data scientist. One robust way to attain this understanding is to strip away all the transient aspects of the role, one by one, through scientific research. In other words, you need to become the craft, rather than merely practice it like a technician of sorts.
In my latest book I underline several aspects of the data science craft that I’ve found, through both experience and research. They are relevant and useful for bringing about the data science mindset in someone. Of course, it is next to impossible to cover all the angles in a single book, but it is a good start. Applicable to all levels of data science practitioners, this book can at the very least make you fascinated about data science and motivate you to learn more about it, without getting consumed by the techniques or the aspects of it that are more in vogue these days (e.g. artificial intelligence). After all, just like everything else in science, data science is more of a process than anything else. It’s up to you to make it an insightful and intriguing one...
Geometry is probably one of the most undervalued aspects of Mathematics. So much so, that people consider it something that is relevant mainly for those pursuing that particular discipline, as in their minds geometry is divorced from other, more practical fields, such as data analytics. However, geometry has always been an applied discipline, intertwined with engineering. As data science and data analytics in general is closely linked to engineering, at least in certain principles, it makes sense to at least consider the relationship between geometry and data analytics.
Geometry involves the study and use of visual mathematical concepts, such as the line, the circle, and other curves, to solve various problems or prove relationships that may be used to solve other, more complex problems. The latter are referred to as theorems and are the core of the scientific literature of geometry. So, unlike other more theoretical parts of mathematics, geometry is practical at its core since it endeavors to solve real-world problems. Although the latter have become increasingly sophisticated since geometry was in its glory days (antiquity), many problems todays still rely on geometry for their solution (e.g. the field of optics, the calculation of trajectories of rockets, and more). Besides, since the times of Descartes, the famous philosopher-mathematician, geometry has become more quantifiable, particularly with his invention of analytical geometry.
Data analytics is in essence a field of applied mathematics, with an emphasis on numeric data, the kind that features heavily in geometry. Although direct connections between the shapes and the proportions of geometry with the data analytics concepts are few and far in between, the mindset is very similar. After all, both disciplines require the practitioner to find out some unknown quantity using some known data, in a methodical and logical manner. In geometry, these correspond to a particular point, shape, or mathematical relationship. In data analytics, these are variables that take the form of features (through refinement, selection, and processing in general) and target variables. Of course, data analytics (esp. data science), has a variety of tools available that facilitate all these, while in geometry it’s just the practitioner’s imagination, a pencil, some paper, and a couple of utensils. However, the mental discipline behind both fields is of the same caliber, while creativity plays an important role in both.
I’m not saying that geometry alone will make someone a good data analytics professional, or that you should give up your data science courses to take up geometry. However, if you have the time and you can also see something elegant in geometry problems, then it can be a very useful past-time, much more useful than other, strictly analytical endeavors. After all, imagination hasn't gone out of fashion, at least not in the applied sciences, so anything that can foster this faculty, while at the same time encourage mental discipline, is bound to be helpful. As a bonus, spending time with geometry is bound to help your visualization skills and enable you to view certain data analytics problems from a different angle (no pun intended). Besides, the same mindset that helped people build pyramids and accomplish several other architectural feats, is what forged many modern algorithms in machine learning, for example, turning some abstract idea or question into something concrete and measurable, be it a design or a process. Isn't that one of the key attributes of a data analytics project?
Lately everyone likes to talk big picture when it comes to data science and artificial intelligence. I’m guilty of this too, since this kind of talk lends itself for blogging. However, it is easy to get carried away and forget that data science is a very detailed process that requires meticulous work. After all, no matter how much automation takes place, mistakes are always possible and oftentimes unavoidable. Even if programming bugs are easier to identify and even prevent, to some extent, some problem may still arise and it is the data scientist’s obligation to handle them effectively.
I’ll give an example from a recent project of mine, a PoC in the text analytics field. The idea was to develop a bunch of features from various texts and then use them to build an unsupervised learning model. Everything in the design and the core functions was smooth, even from the first draft of the code. Yet, when running one of the scripts, the computer kept running out of memory. That’s a big issue, considering that the text corpus was not huge, plus the machine used to run the programs is a pretty robust system, with 16GB of RAM, while it’s also running Linux (so a solid 15GB of RAM are available to the programming language to utilize as needed). Yet, the script would cause the system to slow down until it would eventually freeze (no swap partition was set up when I was installing the OS, since I didn’t expect to ever run out of memory on this machine!) Of course, the problem could be resolved by adding a swap option to the OS, but that still would not be a satisfactory solution, at least not for someone who opts for writing efficient code. After all, when building a system, it is usually built to scale well and this prototype of mine didn’t look very scalable. So, I examined the code carefully and came up with various hacks to manage resources better. Also, I got rid of some unnecessary array that was eating up a lot of memory, and rerouted the information flow so that other arrays can be used to provide the same result. After a couple of attempts, the system was running smoothly and without using too much RAM.
It’s small details like these that make the difference between a data science system that is practical and one that is good only on the conceptual level (or one that requires a large cluster to run properly). Unfortunately, that’s something that is hard to learn through books, videos, or other educational material. Perhaps even conventional experience may not trigger this kind of lesson, though perhaps a good mentor might be very beneficial in such cases. The morale of the story for me is that we ought to continuously challenge ourselves in data science and never be content with our aptitude level. Just because something runs without errors identifiable by the language compiler, doesn’t mean that it’s production-ready. Even in the case of a simple PoC, like this one, we cannot afford to lose focus. Just like the data that is constantly evolving into more and more refined information, data scientists follow a similar process, as we grow into more refined manifestations of the craft.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.