For the past few months I've been working on a tutorial on the data modeling part of the data science process. Recently I've finished it and as of 2 weeks ago, it available online at the Safari portal. Although this tutorial is mainly for newcomers to the field, everyone can benefit from it, particularly people who are interested in not just the technical aspects but also on the concepts behind them and how it all relates to the other parts of the pipeline. Enjoy!
Nowadays, more than ever before, there are a bunch of experts in the data science field, telling everyone what to think and what’s important. This, although useful to some extent, may be a hindrance after you reach a certain level of expertise. That’s not to say that experts’ views are useless, but it’s always good to take them with a pinch of salt.
Experts are people who have learned the field in such depth that they can think of it as people who speak a foreign language can think in terms of that language’s vocabulary and logical structures (e.g. grammar and syntax). An expert in our field doesn't see data science as something outside himself, but rather as a part of him, much like his ability to read and write. This level of intimacy with the know-how in data science enables him to perceive things that most people cannot, and offer deeper insights about the ins and outs of data science.
However, experts don’t know everything and it’s very easy for someone to become so enticed by his expertise that the boundaries of his understanding become blurred. This is a very dangerous thing, since the expert may have the false impression that he knows everything there is to know and/or that everything he knows is valid. However, data science is a very dynamic field, so even if you attain expertise in it, things change so some adaptation is in order. Some experts forget that.
Even if experts have a lot to teach us, we need to always be aware that there are things they do not know, or that they do not know well enough. For example, many experts are very knowledgeable about traditional statistics and whatever lies beyond that part of data science is secondary for them. Yet, even in the field of statistics they only know what they have learned and may lack the curiosity to explore different kinds of Stats, or the humility to acknowledge their existence. Experts like that will tell you that data science is all about statistics, reiterating the stuff they have learned. However, if you try to pinpoint the limitations of what they know, they will label you as a heretic, which is why most people don’t say anything back to them. This is dangerous though, since silence can strengthen their already inflated view of their authority, and bring about even stronger views in them.
That’s why the best approach is to try things out yourself. An expert makes a claim about a certain topic in data science; instead of taking it as fact, put it to the test it to see if it holds water. If it’s something that’s public knowledge, cross-reference it. If it’s something that can be verified or disproved through experimentation, write a script around it. Whatever the case, don’t take things for granted, just because some expert says so.
All this is related to developing the right mindset for data science, which is all about asking questions and trying to answer them in a methodical manner (aka the scientific method), using a variety of data analytics methods and lots of programming. Techniques and tools become obsolete sooner or later, but this mindset I’m referring to is always relevant…
We sometimes find ourselves in situations where no matter what we do, and what model we use, there just isn't anything useful coming out of our analysis. In times like these we wonder if an A.I. system would magically solve the problem. However, it may be the case that there just isn't any signal in the data that we are harvesting.
Of course, this whole thing sounds like a cop-out. It’s easy to say that there is no signal there and throw the towel. However, giving up too quickly is probably worse than not finding a signal there because doing so may eliminating finding something useful in that data ever. That’s why making the decision that there isn’t any signal worth extracting in the data is a tricky thing to do. We must make this decision only after thoroughly examining the data, trying out a variety of feature combinations as well as meta-features, and also experimenting with various models. If after doing all this we still end up with mediocre results that are hard to distinguish from chance, then there probably isn’t anything there, and we can proceed to another project.
However, just because there isn’t a strong enough signal in the data at hand it doesn’t make the whole idea trivial. Maybe there is potential in that idea but we need to pursue it via:
1. more and/or cleaner data like the data we have
2. different kinds of data, to be processed in tandem with the existing data
3. some other application based on that data
The 3rd point is particularly important. Say that we have transaction data, for example, and we want to predict fraud. The data we have is fine, but it is unable to predict anything worthwhile when it comes to fraud. We can still salvage some of the data science work we’ve done though and use it for predicting something else (e.g. some metric for evaluating the efficiency of a transaction, or the general reliability of the network used for these transactions). Just because we cannot predict fraud very well, it doesn’t make the data useless in general.
So, if the data doesn't turn into any viable insights or data products, that’s fine. Not all science experiments end in successful conclusions. We only hear about the success stories in the scientific literature, but for every successful experiment behind these stories there are several other ones that were unsuccessful. As long as we are not daunted by the results and continue working the data, there is always success on the horizon. This success may come about in a somewhat different project though, based on that data. That’s something worth keeping in mind, since it’s really the mindset we have that’s our best asset, even better than our data and our tools.
People talk a lot these days about what it takes to be a good data scientist and how if you do their boot camp or join their course you will acquire that and make yourself stand out from the data scientist pool. Some of these people may be on to something but they generally focus a lot of specific skills and general abilities. That’s fine if you have the time to study what they are saying and find for yourself what you need. However, if you just want a single idea that is in the root of all the stuff they talk about, that’s something few can share with you, because they probably don’t know.
There are data scientists know, however, what it takes to be a good data scientist and many of them have already embodied this in their careers. Yet, they are so busy applying this that they don’t go out of their way to let you know, unless of course they are into education, in which case they will probably mention it in their books or videos.
One feature that I’ve found it succinctly summarizes what it takes to be a good data scientist, regardless of your domain or your specialization, is persistent engagement in the craft. Let’s break this down a bit, since it’s a fairly complex feature (a meta-feature if you will). This comprises of two things working in tandem: persistence and engagement. The first has to do with a sense of rhythm and commitment. All decent data scientists are very focused on what they are doing, even if they are involved in other things (e.g. 90-95% of my work is around data science, though I’m also involved in Cyber Security and to a smaller extent, in Neuroscience). Also, we tend to practice data science in one way or another very regularly. In other words, it is part of our daily routine. That’s all manifestations of consistency.
As for engagement, that is more of an inner state, an aspect of the mindset of a good data scientist. It involves being fascinated by the craft, even if it may seem that it doesn’t have any secrets from you any more. The thing is that there are always new things to learn, especially over time as it evolves and new methods and techniques come about. Engagement is akin to what is known in Zen as the “beginner’s mind” which is a certain approach to things as if they are completely new to you. Coupled with the experience and expertise that a good data scientist has, this approach allows him to go more in depth regarding the field and find new ways to bring about value through data science. It also involves coming up with new models, new processes for data engineering, and in some cases, new data products.
Consistent engagement in data science doesn’t require particular talent or experience, however. Everyone can (and ought to) embrace it. So, instead of trying to memorize the inner workings of some obscure model, just because someone else says so, try cultivating this trait first. Afterwards, everything else will appear easier and more interesting, just like new know-how appears intriguing and within reach, to a novice that has a genuine thirst for learning. After all, there are many ways to achieve mastery of the craft, but they all go through consistent engagement.
“I have never let my schooling interfere with my education.” (quote believed to be originally by Mark Twain)
People talk about education a lot these days, particularly in a data science setting. However, we need to discern between actual education and training. Both are essential, but it is the former that holds the most value. The latter is easier and oftentimes faster, but it may not be a good investment of your time if it is not accompanied by the former.
Education is all about mindset development and the ability to feel inspired from knowledge, thereby developing a healthy yearning for it. It is what happens when you teach a child how to play a game, or do a specific task. Although it’s more of a state of mind than anything else, education also has a formal aspect to it which is related to courses, seminars, workshops and talks, geared towards enhancing one’s understanding and comprehension of the topic at hand.
Training on the other hand is more geared towards techniques, methods, and the technical details of the topic taught. This is useful, of course, since every data scientist needs to know all these things. That’s why there are so many data science books and videos out there! However, knowing how to build an SVM or a neural network doesn’t make someone a competent data scientist. In fact, in some cases it doesn’t make him even an employable one.
Perhaps there is a reason why most companies require X years of experience in their recruits. Some things in data science you can only learn through time, by practicing them and by developing an intuition for the data and how it is processed. Although the idea that a data scientist has to have X years of experience to be worthy is something that remains debatable (why X and not Y?), this trend shows that hiring managers can spot a difference between someone who knows data science from a book (or videos) and someone who knows the craft because she has worked the data and has developed a bunch of models, through lots of trials and the inevitable mistakes that ensue.
Education is therefore something that can be attained through experience, not just reading and watching data science material on the Safari platform. The latter can be a great start, but you still need to get your hands dirty and also think about the whole thing, instead of just following recipes, from a data science cookbook. It’s important to know techniques, no doubt, but unless you have developed an understanding that allows you to go beyond these techniques and explore alternative features and alternative models, you may never grow beyond the advanced beginner stage.
Even someone who has spend most of his life in data science can still learn about this field, as it's a) very diverse and wide-spread, and b) always evolving. Personally, I still find that I’m learning new things as I delve deeper into the field and as I converse with other data scientists and A.I. professionals, of all levels. This too can be a form of education, not any less valuable than the education of creating a new data analytics method, or a new data product. The moment someone starts looking down on education and thinks that he knows “enough” is the moment he begins becoming obsolete.
Sometimes it’s easy to get carried away and focus on data science too much, losing sight of the applications of it. Although this is something somewhat common in an academic setting (particularly in universities that don’t have any ties to the industry), it may happen in companies too. When this happens, it’s usually best to walk away, since data science without any real-world application can be problematic.
Data science and A.I. that’s geared towards data analytics, involve a lot of scientific methodologies, which are quite interesting on their own. This may urge someone to get lost in that aspect of the craft and neglect the application part, particularly the one where these methodologies are employed for solving real-world problems. That’s not to say that doing data science research is bad. Quite the contrary. However, when the research is without any application, focusing too much on the math side of things, it is bound to be a waste of resources (unless you are doing this as part of a research project, e.g. for a research center or a university, in which case this is expected). The reason is that data science is by definition an applied field, much like engineering. Particularly when it is undertaken by a company (e.g. a startup), it needs to be able to deliver something concrete, and more importantly, something useful.
It’s hard to over-estimate the value of this aspect of data science that has to do with the end-user. After all, this person is often the one paying the bills! Also, focusing on the application part of the craft enables something else too: the more practical implementation of the technologies developed and the inception of new methods that are more hands-on and therefore useful. This is one of the reasons that data science has veered away from Statistics, a field which is by its nature more theoretical and more math-y than applied Science. That’s also the main reason why data science involves a lot of programming, oftentimes building things from scratch, even if it’s simple scripts. That’s quite different than using an all-in-one software package, like SAS or SPSS, where the user merely calls functions and does rudimentary data processing.
You can come up with ingenious methods in data science, that would be able to fetch a journal publication or two. However, if these methods don’t add value to an organization, they are not that great, from a holistic standpoint. This is observed in other parts of Science too, e.g. Electromagnetism. Despite the various theoretical aspects of that field, its usefulness is also apparent. People who practice this part of Physics tend to be very practical and oftentimes come up with interesting inventions that add value to their user (e.g. in the case of electromagnets, or power transformers). Data science is not any different.
All the clever mathematics behind a method may be enchanting for the mind, but it’s when this method is put into practice and yields some oftentimes actionable insight when it really becomes meaningful. That’s something worth remembering, since it’s easy to lose sight of the questions we are trying to answer, and focus too much on the possibilities that we discover. And some may argue that it’s the journey that matters, but for a journey to be a journey there needs to be a destination. The latter is usually some person who doesn't care much about the science behind the insights, but more about their applicability and usefulness. Companies like MAXset LLC may be completely ignorant of that, but this doesn't make it a viable strategy. On the other hand, companies that have a chance of providing true value to the world make the business aspect of the craft their priority.
It is easy to fall into this misconception of believing that in data science we are all solitary people doing our work and interacting only in the workplace and in the social media. Perhaps we are part of some data science team, but still feel we are still on our own when it comes to our relationship with the field. However, this is just one of many possibilities in how we relate to the data science world, and it is definitely not the best one.
Being part of a community in data science is not only possible but also necessary. Of course just networking with other data scientists may not be enough, but it is often a good starting point. This is particularly important towards the beginning of one’s career. After all, not even the best data science books can give someone solace in times of difficulty or doubt. That’s when having a good mentor comes in very handy. After all, even if that mentor is a bit aloof and preoccupied with his own stuff, he tends to have a genuine interest in your career and is motivated to help you out, at least to some extent. This can be another step towards becoming part of a community of data science professionals.
Make no mistake, however. Neither the mentor, nor anyone else is going to fight your battles for you. The other data scientists, be it professional acquaintances, mentors, or teammates, have their own battles to tackle. However, they may be able to offer you advice or help you gain insight to solutions that you couldn't think of by yourself, especially during the time you are immersed in the problems you are tackling.
Finding a physical community may not always be possible. Not all cities are as advanced as the ones where the field thrives and has a cohorts bustling with data science events and activities. However, data scientists are out there who are also in need of a community, so it’s only a matter of time before you find them. Perhaps you’ll “meet” them online, through some social network or a data science forum. Maybe you’ll encounter them in a data science conference, or a webinar. Bottom line, if you are open to finding a community of data scientists, the opportunities to do so will manifest, sooner or later.
Being part of a data science community is not only to help you in difficult times though. It’s also a great accelerator for developing yourself as a data scientist through being exposed to new trends, novel approaches to known problems, and most importantly, to unknown problems that you’d probably not encounter on your own, even if you work in a data-driven company. All that is bound to foster in you the knowledge and know-how you need to advance to the next level, whatever that level is for you. At the same time, it can help you maintain your enthusiasm for data science, and perhaps even make you more zestful about the field. After all, it is usually the people who are passionate about something that make the most progress in it and are also consistent in do so. Data science is not any different in that respect.
Lately I've been thinking about A.I. and Statistics a lot (you could say that the amount of time spent thinking about these topics is significantly higher at alpha = 0.05!). This is partly because my Stats article managed to get more traction than any other article I've written in the past few months, and partly because A.I. is becoming more and more relevant in our field. So, the question of whether A.I. is one day going to replace Stats altogether remains a very relevant one.
The key advantage of A.I. methods is that they are assumption-free. This by itself enables them to tackle the problems they are aiming to solve, in a very methodical and efficient way. Of course, certain assumptions might speed things up, but they might obstruct the discovery of the optimal solutions to the problem at hand. Statistical inference models lost the war against machine learning models because of that, especially when artificial neural networks (ANNs) entered the scene. Also, the fact that many ML models could be combined in an ensemble setting allowed them to become even more robust, attaining F1 scores that were unfathomable for statistical prediction models. So, the possibility of other methods of statistics becoming outsourced to alternative systems is quite real.
On the other hand, statistics are very easy to use and interpret, since most of them were designed from a user’s perspective. There are doctors out there (the medical kind), who don’t know much about data analytics but can easily work a statistical model for figuring out if a certain drug has a positive influence on certain patients, and derive some scientific conclusions based on that. That doctor may not be able to write a script to save his life, but he can make use of the data he gathers and advance his scientific field, using just statistics. It’s quite unlikely that this kind of person, who is usually too busy or just not technically adept enough, will take up an A.I. approach to this kind of analysis any time soon.
Of course, A.I. constantly evolves so the black-box issue that makes many ANN-based systems unfavorable, may wane in the future. Already there are A.I. professionals talking about A.I. systems that offer some kind of interpretability. So, even if statistical systems are easier to understand and communicate, it could be that A.I. hasn't said its final word yet.
Whatever the case, I prefer to remain agnostic on this matter. Just like with programming, it’s best to keep one’s options open, when it comes to data science. I’m not a fan of statistics (and never was), but I see value in them and I’m happy to use them to the extent that they offer value to the projects I work on. A.I. may be more of a novel and exciting framework, but if an A.I. system is hard to communicate to the client, or doesn't lend itself to interpretation, then I may not use it everywhere. Just like you don’t take your fancy fringe science book to the beach, you don’t need to show off your A.I. know-how at every opportunity. Perhaps the humble historic novel is more suitable for reading while sunbathing, just like the humble statistics are fine for describing if sample A is significantly different from sample B.
Recently someone on LI recommend that I bring more JOY to the world instead of merely complain about it (I wasn’t complaining but apparently she thought I were!). I’m not an entertainer, nor a psychology expert, but perhaps you don’t need to be in these lines of work in order to bring joy to the people you interact with. I thought about it and decided that perhaps data science could be a source of joy to other people. However, for this to happen, it needs first and foremost to be joyful to you.
Deriving joy from a challenging and oftentimes frustrating procedure such as a data science project is not easy. In fact, many people can’t stand that largest part of the work such a project entails. However, with the right mindset, even the more tedious aspects of the work can be enjoyable (i.e. be conducive to joy). So, what is this mindset that turns boredom to beauty and drudgery to delight?
Although there is no magic formula for making things more enjoyable in data science, if you have the attitude of the data science amateur when you approach a problem, your chances of enjoying it are better. This doesn’t mean being sloppy and checking Stackoverflow or Quora every 5 minutes. The amateur’s attitude is, as the word amateur implies, an attitude based on love for what you are doing. The amateur doesn’t care if they get paid for their work. They may even never get paid, but they do it anyway because they find it fulfilling. It’s like a hobby for them.
However, a data scientist still needs to be professional about her work. There are deadlines, meetings with stakeholders, and of course debugging scripts that throw errors at the worst possible time! Handling these matters takes professionalism, but it doesn’t need to be a mechanical and draining process. If you see part of your work as a data scientist (even the debugging stage) as a learning experience and have what is known in Zen as the beginner’s mind, you are bound to find everything a bit more enjoyable. It’s the joy that comes from detachment and lack of rigid expectations from your work, something that every professional knows.
Remembering all this, especially on a Monday morning, is not as straight-forward as it may seem when you think of it. However, being joyful is a matter of perspective and at the end of the day a matter of habit. Aristotle famously said that “virtue is a matter of habit” and some could argue that joy is a kind of virtue. Maybe not something you would put on your resume or talk about in an interview, but definitely something worth keeping in mind in those long mornings when you may be tempted to question your career choices. After all, if you could be joyful about data science as a field once, you can be joyful about data science work too. And if you still feel that you need some help to get your enthusiasm flowing, invigorating a joyful mindset, you can always read my book Data Science – Mindset, Methodologies, and Misconceptions. :-)
After several days being in limbo, the video "Remaining Relevant in Data Science" that I've made recently, is now online on Safari (link). If you have a subscription to that platform, do check it out. If you prefer to access this kind of knowledge through a different medium, feel free to check out the last chapter of my latest book, Data Science Mindset, Methodologies, and Misconceptions. Enjoy!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.