Even though this topic may be a bit polarizing, especially among people who are new to data science, knowing more about it can be very useful, particularly if you value a sense of perspective more than a good grade in some data science crash course. The latter is bound to overemphasize either Stats or AI, depending on the instructor's knowledge and experience. However, some data science professionals, myself included, prefer a more balanced approach on the topic. This is the reason why I decided to make this video, which is now available on Safari for your viewing.
Note that this is by no means a complete tutorial on the topic, but it is a good overview of the various aspects of statistics related to data science, along with some programming resources in both Python and Julia, to get you started. Enjoy!
A.I. and ML are often used interchangeably, while many people consider one to be a subset of the other (which one is the bigger set depends on who you ask). However, things may not be as clear-cut as they may seem, since the communities of these two fields are not all the related, while there is a sort of rivalry among the hard-core members of each one of them. Why is that though if A.I. and ML are so similar to each other, enough to confuse even data scientists?
First of all, let’s start with some definitions. A.I. is the group of methods, algorithms, and processes, that bring about computer systems that emulate human intelligence, even if the intelligence they usually exhibit is quite different to our own. Also, these systems often take the form of self-sufficient machines, such as robots, as well as agent programs that roam the Internet or cyber space in general. ML on the other hand is the group of methods, algorithms, and processes that bring about computer systems that solve some data analytics problem in an efficient manner, through some training procedure (the learning part of machine learning). The latter can be with the help of some specific outcomes (aka targets) or without. Also, the training can take the form of feedback on the system’s predictions, which is like on-the-job training of sorts.
Clearly, there is a close link between ML and data science, since ML systems are designed for this sort of problems. A.I. systems on the other hand, may tackle different kinds of problems too (e.g. finding the optimal route given some restrictions). So, there is a part of A.I. that is leveraged in data science and a part of A.I. that has nothing to do with our craft. That part of A.I. that is used in data science has a large intersect with ML, mainly through network-based systems, such as ANNs. Lately, Deep Learning networks, which are specialized and more sophisticated kinds of ANNs, have become quite popular and are also part of that intersect between A.I. and ML.
Many people who work in A.I. consider it more of a science than ML and they are right in a way. Most of ML methods are heuristics based and don’t have much theory behind them, while the ones that are tied to Stats (Statistical and ML hybrids) are heavily restrained by the assumptions that the Stats theory has. A.I. methods are generally data-driven though, but also related to processes found in nature, so they are not out of the blue.
Nevertheless, a data scientist who is being professional and pragmatic doesn’t put too much emphasis on the differences between A.I. and ML methods, since he cares more about how they can be applied to solve the problems at hand. So, even if these two families of methods are not the same, nor is one a subset of the other, they are both very useful, if not essential, in practical data science.
Recently a far-reaching scandal broke out as a reporter exposed a data science company called Cambridge Analytica. According to the information gathered, that company used a dataset harvested via Facebook and enriched with a lot of data from the Facebook graph too, in order to use it to affect the presidential elections of 2016 in the USA. It is important to note that the role of that project was not exploratory (e.g. related to the finding of insights related to the voters), but rather it aimed at steering the voters’ views on a certain candidate, in order to benefit the other candidate, which was the company’s client.
Personally I’m not vested in US politics and don’t have any strong views on the matter, which is why I chose to omit the names of the politicians involved. As a data science professional, however, I find what C.A. did was shameful and unethical, on many levels. Examples like this only go to show that just like everything else in applied science, data science can be used for malicious purposes too, something that every data scientist ought to be aware of and avoid whenever possible.
Also, a topic like this one concerns not just data scientists but anyone working alongside them, since it would be naive to believe that this whole fiasco was the result of a few data science professionals acting on their own. As the corresponding footage shows, the black-hat approach to data analytics was initiated by the company’s head, who was quite forth-coming about what the company was trying to do. That doesn’t make the data scientists working there innocent victims, but at least the responsibility of this dark project is shared among everyone there, not just them. Also, considering that it wasn’t a huge company, it’s quite unlikely that the data scientists weren’t aware of the unethical and immoral agenda their work was serving. However, it is clear that if they hadn’t cooperated with this plan, this could at the very least have slowed things down.
So, how can we guard ourselves from situations like that of the C.A. scandal, as data science professionals? First of all, we can avoid working for people who don’t have a moral compass and who are looking at how the data products developed can be used to covertly drive certain behaviors that if exposed, would be punishable. So, if the leaders of a project are shady individuals and don’t mind hurting others in order to make their clients happy, that’s a red flag.
The data itself could be another potential warning sign. If it is collected through unethical means and used in ways that compromise the people’s privacy, then that’s a tell-tale sign that there is something fishy going on. Another such sign is the insights discovered through such a project (in this case the categorization of the people involved into four groups that relate to some intimate aspects of their personalities). If we are not comfortable sharing these insights with those people (assuming that there were no NDA in place prohibiting that), because it just feels wrong, then we shouldn’t be digging up those insights to start with.
Finally, if the data products don’t serve the people involved in the data behind these products, even indirectly, then that’s another red flag. The products we create should be something we can talk about openly (without giving away any sensitive know-how behind them, of course), without feeling ashamed or guilty about their purpose.
Naturally, these few suggestions are but the tip of the iceberg of a very large topic related to the Ethics aspect of our profession. I cannot hope to do this topic justice through a blog article, or even a video like the one I made on this topic last year. However, it’s good to remember that we are not powerless against the malicious use of data science by people who are either immoral or amoral, caring only for themselves at the expense of the well-being of others. We may not always be able to stop their agenda, but we can at least identify an unethical project and not contribute to it. Besides, there are many things we can do with data science, so why not focus on the more beneficial ones instead?
When it comes to DS education, nowadays there is a lot of emphasis given in one of two things: the math aspect of it, and the complex algorithms of deep learning systems. Although all this is essential, particularly if you want to be a future-proof data science professional, there is much more to the field than that. Namely, the engineer mentality is something that you need to cultivate, since at its core, data science is an engineering discipline. I don’t mean that in a software manner, but more of a practicality and efficiency oriented approach to building a system.
This is largely due to the scaling dimension of a data science metric or model. Unfortunately most data science “educators” fail to elaborate on this point, since they focus mainly on parroting other people’s work, instead of inciting students to gain a deeper understanding of the methods and processes being taught. Also, scaling something is the filter that distinguishes a robust algorithm from a mediocre one. As we obtain more and more data, having an algorithm that works well on a small dataset only (or one that requires a great deal of parallelization to yield any benefits), is not sustainable. Of course some people are happy with that, since they have a great deal of resources available, which they are happy to rent out. However, we can often obtain good enough results with less resources, through algorithms that have better scaling. Even if most people don’t share this fox-like approach to data science, it doesn’t make it less relevant. After all, many people associate methods with the frameworks particular companies offer, rather than understand the science behind these methods.
Scaling a method up intelligently is the product of three things:
1. having a deep understanding of a method
2. not relying on an abundance of resources to scale it up
3. being creative about the method, making compromises where necessary, to make it more lightweight
That’s where the engineering mentality comes it. The engineer understands the math, but isn’t concerned about having the perfect solution to a problem. Instead, he cares about having a good enough solution that is reliable and not too costly.
This kind of thinking is what drives the development of modern optimization systems, which are an important part of AI. Artificial Intelligence may involve things like deep learning networks, but there is more to it than that. So, if you want to delve more into this field and its numerous applications in data science, cultivating this engineering mentality is the optimal way to go. Perhaps not the absolute best one, but definitely one that works well and is efficient enough!
For the past few months I've been working on a tutorial on the data modeling part of the data science process. Recently I've finished it and as of 2 weeks ago, it available online at the Safari portal. Although this tutorial is mainly for newcomers to the field, everyone can benefit from it, particularly people who are interested in not just the technical aspects but also on the concepts behind them and how it all relates to the other parts of the pipeline. Enjoy!
Nowadays, more than ever before, there are a bunch of experts in the data science field, telling everyone what to think and what’s important. This, although useful to some extent, may be a hindrance after you reach a certain level of expertise. That’s not to say that experts’ views are useless, but it’s always good to take them with a pinch of salt.
Experts are people who have learned the field in such depth that they can think of it as people who speak a foreign language can think in terms of that language’s vocabulary and logical structures (e.g. grammar and syntax). An expert in our field doesn't see data science as something outside himself, but rather as a part of him, much like his ability to read and write. This level of intimacy with the know-how in data science enables him to perceive things that most people cannot, and offer deeper insights about the ins and outs of data science.
However, experts don’t know everything and it’s very easy for someone to become so enticed by his expertise that the boundaries of his understanding become blurred. This is a very dangerous thing, since the expert may have the false impression that he knows everything there is to know and/or that everything he knows is valid. However, data science is a very dynamic field, so even if you attain expertise in it, things change so some adaptation is in order. Some experts forget that.
Even if experts have a lot to teach us, we need to always be aware that there are things they do not know, or that they do not know well enough. For example, many experts are very knowledgeable about traditional statistics and whatever lies beyond that part of data science is secondary for them. Yet, even in the field of statistics they only know what they have learned and may lack the curiosity to explore different kinds of Stats, or the humility to acknowledge their existence. Experts like that will tell you that data science is all about statistics, reiterating the stuff they have learned. However, if you try to pinpoint the limitations of what they know, they will label you as a heretic, which is why most people don’t say anything back to them. This is dangerous though, since silence can strengthen their already inflated view of their authority, and bring about even stronger views in them.
That’s why the best approach is to try things out yourself. An expert makes a claim about a certain topic in data science; instead of taking it as fact, put it to the test it to see if it holds water. If it’s something that’s public knowledge, cross-reference it. If it’s something that can be verified or disproved through experimentation, write a script around it. Whatever the case, don’t take things for granted, just because some expert says so.
All this is related to developing the right mindset for data science, which is all about asking questions and trying to answer them in a methodical manner (aka the scientific method), using a variety of data analytics methods and lots of programming. Techniques and tools become obsolete sooner or later, but this mindset I’m referring to is always relevant…
We sometimes find ourselves in situations where no matter what we do, and what model we use, there just isn't anything useful coming out of our analysis. In times like these we wonder if an A.I. system would magically solve the problem. However, it may be the case that there just isn't any signal in the data that we are harvesting.
Of course, this whole thing sounds like a cop-out. It’s easy to say that there is no signal there and throw the towel. However, giving up too quickly is probably worse than not finding a signal there because doing so may eliminating finding something useful in that data ever. That’s why making the decision that there isn’t any signal worth extracting in the data is a tricky thing to do. We must make this decision only after thoroughly examining the data, trying out a variety of feature combinations as well as meta-features, and also experimenting with various models. If after doing all this we still end up with mediocre results that are hard to distinguish from chance, then there probably isn’t anything there, and we can proceed to another project.
However, just because there isn’t a strong enough signal in the data at hand it doesn’t make the whole idea trivial. Maybe there is potential in that idea but we need to pursue it via:
1. more and/or cleaner data like the data we have
2. different kinds of data, to be processed in tandem with the existing data
3. some other application based on that data
The 3rd point is particularly important. Say that we have transaction data, for example, and we want to predict fraud. The data we have is fine, but it is unable to predict anything worthwhile when it comes to fraud. We can still salvage some of the data science work we’ve done though and use it for predicting something else (e.g. some metric for evaluating the efficiency of a transaction, or the general reliability of the network used for these transactions). Just because we cannot predict fraud very well, it doesn’t make the data useless in general.
So, if the data doesn't turn into any viable insights or data products, that’s fine. Not all science experiments end in successful conclusions. We only hear about the success stories in the scientific literature, but for every successful experiment behind these stories there are several other ones that were unsuccessful. As long as we are not daunted by the results and continue working the data, there is always success on the horizon. This success may come about in a somewhat different project though, based on that data. That’s something worth keeping in mind, since it’s really the mindset we have that’s our best asset, even better than our data and our tools.
People talk a lot these days about what it takes to be a good data scientist and how if you do their boot camp or join their course you will acquire that and make yourself stand out from the data scientist pool. Some of these people may be on to something but they generally focus a lot of specific skills and general abilities. That’s fine if you have the time to study what they are saying and find for yourself what you need. However, if you just want a single idea that is in the root of all the stuff they talk about, that’s something few can share with you, because they probably don’t know.
There are data scientists know, however, what it takes to be a good data scientist and many of them have already embodied this in their careers. Yet, they are so busy applying this that they don’t go out of their way to let you know, unless of course they are into education, in which case they will probably mention it in their books or videos.
One feature that I’ve found it succinctly summarizes what it takes to be a good data scientist, regardless of your domain or your specialization, is persistent engagement in the craft. Let’s break this down a bit, since it’s a fairly complex feature (a meta-feature if you will). This comprises of two things working in tandem: persistence and engagement. The first has to do with a sense of rhythm and commitment. All decent data scientists are very focused on what they are doing, even if they are involved in other things (e.g. 90-95% of my work is around data science, though I’m also involved in Cyber Security and to a smaller extent, in Neuroscience). Also, we tend to practice data science in one way or another very regularly. In other words, it is part of our daily routine. That’s all manifestations of consistency.
As for engagement, that is more of an inner state, an aspect of the mindset of a good data scientist. It involves being fascinated by the craft, even if it may seem that it doesn’t have any secrets from you any more. The thing is that there are always new things to learn, especially over time as it evolves and new methods and techniques come about. Engagement is akin to what is known in Zen as the “beginner’s mind” which is a certain approach to things as if they are completely new to you. Coupled with the experience and expertise that a good data scientist has, this approach allows him to go more in depth regarding the field and find new ways to bring about value through data science. It also involves coming up with new models, new processes for data engineering, and in some cases, new data products.
Consistent engagement in data science doesn’t require particular talent or experience, however. Everyone can (and ought to) embrace it. So, instead of trying to memorize the inner workings of some obscure model, just because someone else says so, try cultivating this trait first. Afterwards, everything else will appear easier and more interesting, just like new know-how appears intriguing and within reach, to a novice that has a genuine thirst for learning. After all, there are many ways to achieve mastery of the craft, but they all go through consistent engagement.
“I have never let my schooling interfere with my education.” (quote believed to be originally by Mark Twain)
People talk about education a lot these days, particularly in a data science setting. However, we need to discern between actual education and training. Both are essential, but it is the former that holds the most value. The latter is easier and oftentimes faster, but it may not be a good investment of your time if it is not accompanied by the former.
Education is all about mindset development and the ability to feel inspired from knowledge, thereby developing a healthy yearning for it. It is what happens when you teach a child how to play a game, or do a specific task. Although it’s more of a state of mind than anything else, education also has a formal aspect to it which is related to courses, seminars, workshops and talks, geared towards enhancing one’s understanding and comprehension of the topic at hand.
Training on the other hand is more geared towards techniques, methods, and the technical details of the topic taught. This is useful, of course, since every data scientist needs to know all these things. That’s why there are so many data science books and videos out there! However, knowing how to build an SVM or a neural network doesn’t make someone a competent data scientist. In fact, in some cases it doesn’t make him even an employable one.
Perhaps there is a reason why most companies require X years of experience in their recruits. Some things in data science you can only learn through time, by practicing them and by developing an intuition for the data and how it is processed. Although the idea that a data scientist has to have X years of experience to be worthy is something that remains debatable (why X and not Y?), this trend shows that hiring managers can spot a difference between someone who knows data science from a book (or videos) and someone who knows the craft because she has worked the data and has developed a bunch of models, through lots of trials and the inevitable mistakes that ensue.
Education is therefore something that can be attained through experience, not just reading and watching data science material on the Safari platform. The latter can be a great start, but you still need to get your hands dirty and also think about the whole thing, instead of just following recipes, from a data science cookbook. It’s important to know techniques, no doubt, but unless you have developed an understanding that allows you to go beyond these techniques and explore alternative features and alternative models, you may never grow beyond the advanced beginner stage.
Even someone who has spend most of his life in data science can still learn about this field, as it's a) very diverse and wide-spread, and b) always evolving. Personally, I still find that I’m learning new things as I delve deeper into the field and as I converse with other data scientists and A.I. professionals, of all levels. This too can be a form of education, not any less valuable than the education of creating a new data analytics method, or a new data product. The moment someone starts looking down on education and thinks that he knows “enough” is the moment he begins becoming obsolete.
Sometimes it’s easy to get carried away and focus on data science too much, losing sight of the applications of it. Although this is something somewhat common in an academic setting (particularly in universities that don’t have any ties to the industry), it may happen in companies too. When this happens, it’s usually best to walk away, since data science without any real-world application can be problematic.
Data science and A.I. that’s geared towards data analytics, involve a lot of scientific methodologies, which are quite interesting on their own. This may urge someone to get lost in that aspect of the craft and neglect the application part, particularly the one where these methodologies are employed for solving real-world problems. That’s not to say that doing data science research is bad. Quite the contrary. However, when the research is without any application, focusing too much on the math side of things, it is bound to be a waste of resources (unless you are doing this as part of a research project, e.g. for a research center or a university, in which case this is expected). The reason is that data science is by definition an applied field, much like engineering. Particularly when it is undertaken by a company (e.g. a startup), it needs to be able to deliver something concrete, and more importantly, something useful.
It’s hard to over-estimate the value of this aspect of data science that has to do with the end-user. After all, this person is often the one paying the bills! Also, focusing on the application part of the craft enables something else too: the more practical implementation of the technologies developed and the inception of new methods that are more hands-on and therefore useful. This is one of the reasons that data science has veered away from Statistics, a field which is by its nature more theoretical and more math-y than applied Science. That’s also the main reason why data science involves a lot of programming, oftentimes building things from scratch, even if it’s simple scripts. That’s quite different than using an all-in-one software package, like SAS or SPSS, where the user merely calls functions and does rudimentary data processing.
You can come up with ingenious methods in data science, that would be able to fetch a journal publication or two. However, if these methods don’t add value to an organization, they are not that great, from a holistic standpoint. This is observed in other parts of Science too, e.g. Electromagnetism. Despite the various theoretical aspects of that field, its usefulness is also apparent. People who practice this part of Physics tend to be very practical and oftentimes come up with interesting inventions that add value to their user (e.g. in the case of electromagnets, or power transformers). Data science is not any different.
All the clever mathematics behind a method may be enchanting for the mind, but it’s when this method is put into practice and yields some oftentimes actionable insight when it really becomes meaningful. That’s something worth remembering, since it’s easy to lose sight of the questions we are trying to answer, and focus too much on the possibilities that we discover. And some may argue that it’s the journey that matters, but for a journey to be a journey there needs to be a destination. The latter is usually some person who doesn't care much about the science behind the insights, but more about their applicability and usefulness. Companies like MAXset LLC may be completely ignorant of that, but this doesn't make it a viable strategy. On the other hand, companies that have a chance of providing true value to the world make the business aspect of the craft their priority.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.