In many data science courses, these peculiar data points in a dataset often go by the term “anomalies” and are considered to be inherently bad. In fact, it is suggested by many that they be removed before the data modeling stage. Now, for obvious reasons I cannot contradict that approach partly because I myself have taken that stance when covering basic data engineering topics, but also because there is merit in this treatment of outliers and inliers. After all, they are just too weird to be left as they are, right?
Well, it depends. In all the cases when they are removed, it’s usually because we are going to use some run-of-the-mill model that is just too rudimentary to do anything clever with the data it’s given. So, if there are anomalous data points in the training set, it’s likely to over-fit or at the very least under-perform. This would not happen so often though in an A.I. model, which is one of the reasons why the data engineering stage is so closely linked to the data modeling one. Also, sometimes the signal we are looking for lies in those particular anomalous elements, so getting rid of them isn’t that wise then.
Regardless of all that though, we need to differentiate between these two kinds of anomalies. The outliers can be easily smoothed out, if we were to adopt a possibilistic way of handing the day, instead of the crude statistical metrics we are used to using. Smoothing outliers is also a good way to retain more signal in the dataset (especially if it’s a small sample that we are working with), something that translates into better-performing models.
Inliers though are harder to process. Oftentimes removing them is the best strategy, but they need to be looked at holistically, not just in individual variables. Also, even if they distort the signal at times, they may not be that harmful when doing dimensionality reduction, so keeping them in the dataset may be a good idea. Nevertheless, it’s good to make a note of these anomalous elements, as they may have a particular significance once the data is processed by a model we build. Perhaps we can use them as fringe cases in a classification model, for example, to do some more rigorous testing to it.
To sum up, outliers and inliers are interesting data points in a dataset and whether they are more noise than signal depends on the problem we are trying to solve. When tackled in a multi-dimensional manner, they can be better identified, when when processed, certain care needs to be taken. After all, just because certain data analytics methods aren’t well-equipped to handle them, we shouldn’t change our data to suit the corresponding models / metrics. Often we have more to gain by shifting our perspective and adapting our ways to the data at hand. The possibilistic approach to data may be a great asset in all that. Should you wish to learn more about outlier and inliers, you can check out my presentation video on this topic in the Safari platform.
Contrary to the probabilistic approach to data analytics, which relies on probabilities and ways to model them, usually through a statistical framework, the possibilistic approach focuses on what’s actually there, not what could be there, in an effort to model uncertainty. Although not officially a paradigm (yet), it has what it takes to form a certain mindset, highly congruent with that of a competent data scientist.
If you haven’t heard of the possibilistic approach to things, that’s normal. Most people have already jumped on the bandwagon of the probabilistic dogma, so someone seriously thinking of things possibilistically would be considered eccentric at best. After all, the last successful possibilistic systems are often considered obsolete, due to their inherent limitations when it came to higher dimensionality datasets. I’m referring to the Fuzzy Logic systems, which are part of the the GOFAI family of A.I. systems (in these systems the possibilities are expressed as membership levels, through corresponding functions). These systems are still useful, of course, but not the go-to choice when it comes to building an AI solution to most modern data science problems.
Possibilistic reasoning is that which relies on concrete facts and observable relationships in the data at hand. It doesn’t assume anything, nor does it opt for shortcuts by summarizing a variable with a handful of parameters corresponding to a distribution. So, if something is predicted with a possibilistic model, you know all the how’s and why’s of that prediction. This is directly opposite to the black-box predictions of most modern AI systems.
Working with possibilities isn’t easy though. Oftentimes it requires a lot of computational resources, while an abundance of creativity is also needed, when the data is complex. For example, you may need to do some clever dimensionality reduction before you can start looking at the data, while unbiased sampling may be a prerequisite also, particularly in transduction-related systems. So, if you are looking for a quick-and-easy way of doing things, you may want to stick with MXNet, TensorFlow, or whatever A.I. framework takes your fancy.
If on the other hand you are up for a challenge, then you need to start thinking in terms of possibilities, forgetting about probabilities for the time being. Some questions that may help in that are the following:
* How much does each data point contribute to a metric (e.g. one of central tendency or one of spread)?
* Which factors / features influence the similarity between two data points and by how much?
* What do the fundamental components of a dataset look like, if they are defined by both linear and non-linear relationships among the original features?
* How can we generate new data without any knowledge of the shape or form of the original dataset?
* How can we engineer the best possible centroids in a K-means-like clustering framework?
* What is an outlier or inlier essentially and how does it relate to the rest of the dataset?
For all of these cases, assume that there is no knowledge of the statistical distributions of the corresponding variables. In fact, you are better off disregarding any knowledge of Stats whatsoever, as it’s easy to be tempted to use a probability-based approach.
Finally, although this new way of thinking about data is fairly superior to the probabilistic one, the latter has its uses too. So, I’m not advocating that you shouldn’t learn Stats. In fact, I’d argue that only after you’ve learned Stats quite well, will you be able to appreciate the possibilistic approach to data in full. So, if you are looking into A.I., Machine Learning, or both, you may want to consider a possibilistic way of tackling uncertainty, instead of blindly following those who have vested interests in the currently dominant paradigm.
Although I’ve talked about dimensionality reduction for data science in the corresponding video on Safari, covering various angles of the topic, I was never fully content with the methodologies out there. After all, all the good ones are fairly sophisticated, while all the easier ones are quite limited. Could there be a different (better) way of performing dimensionality reduction in a dataset? If so, what issue would such a method tackle?
First of all, conventional dimensionality reduction methods tend to come from Statistics. That’s great if the dataset is fairly simple, but methods like PCA focus on the linear relationships among the features, which although it’s a good place to start, it doesn’t cover all the bases. For example, what if features F1 and F2 have a non-linear relationship? Will PCA be able to spot that? Probably not, unless there is a strong linear component to it. Also, if F1 and F2 follow some strange distribution, the PCA method won’t work very well either.
What's more, what if you want to have meta-features that are independent to each other, yet still explain a lot of variance? Clearly PCA won’t always give you this sort of results, since for complex datasets the PCs will end up being tangled themselves. Also, ICA, a method designed for independent components, is not as easy to use since it’s hard to figure out exactly where to stop when it comes to selecting meta-features.
In addition, what’s the deal with outliers in the features? Surely they affect the end result, by changing the whole landscape of the features, breaking the whole scale equilibrium at times. Well, that’s one of the weak point of PCA and similar dimensionality reduction methods, since they require some data engineering before they can do their magic.
Finally, how much does each one of the original features contribute to the meta-features you end up with after using PCA? That’s a question that few people can answer although the answer is right there in front of them. Also, such a piece of information may be useful in evaluating the original features or providing some explanation of how much they are worth in terms of predictive potential, after the meta-features are used in a model.
All of these issues and more can be tackled by using a new approach to dimensionality reduction, one that is based on a new paradigm (the same one that can tackled the clustering issues mentioned in the previous post). Also, even though the new approach doesn’t use a network architecture, it can still be considered a type of A.I. as there is some kind of optimization involved. As for the specifics of the new approach, that’s something to be discussed in another post, when the time is right...
I was never particularly fond of this unsupervised learning methodology that’s under the umbrella of machine learning. It’s not that I didn’t see value in it, but the methods that were available for it when I started delving into it were rudimentary at best and fairly crude. In fact, if I were to do a PhD now, I’d choose a clustering-related topic since there is so much room for improvement that even a simple idea for improving the most popular clustering methods out there is bound to improve them!
However, the fact that data science researchers and machine learning engineers in particular haven’t spent much time looking into clustering doesn’t make clustering a bad methodology. In fact, I’d argue that it’s one of the most insightful ones and it plays an important role in many data science projects, particularly in the data exploration stage.
The key issues with clustering are:
1. The whole set of distance metrics used
2. The fact that the vast majority of clustering methods yield a (slightly) different result every time they are run
3. The need of an external parameter (K) in most clustering methods used in practice, in order to define how many clusters there are
4. The fact that it’s very shallow in its results
There may be more issues with clustering, but these are the most important ones I’ve found. So, if we were to rethink clustering and do it better, we’d need to address each one of these issues. Namely:
1. A new set of distance metrics would be needed, metrics that are not influenced by the dimensional “noise” so much, in the case of many dimensions in the dataset.
2. The option for a deterministic clustering method, one that would optimize the centroid seed before starting the whole clustering process
3. An optimization process would be in place so as to find the best number of clusters. This should include the possibility of a single cluster, in the case where there isn’t enough diversity in the dataset.
4. A multi-level clustering option needs to be available, much like hierarchical clustering but in reverse, i.e. start with the main clusters in the dataset and gradually dig deeper into levels of sub-clusters.
Now, all this may sound simple but it’s not as easy to put into practice. Apart from an in-depth understanding of data science, a quite refined programming ability is needed too, so that the implementation of this clustering approach can be efficient and scalable. Perhaps all this is not even possible with the conventional data analytics framework, but there is not a single doubt in my mind that it is possible in general, and if a high-performance language is used (e.g. Julia), it is even practically feasible.
Naturally, a clustering framework like this one would require a certain level of A.I. to be used. This doesn’t have to be an ANN though, since A.I. can take many forms, not just network-based ones. Whatever the case, conventional statistics-based methods may be largely inadequate, while the very basic machine learning methods for clustering may not be sufficient either.
This illustrates something that many data science practitioners have forgotten: that data science methods evolve, just like other aspects of the craft. New tools may be intriguing, but equally intriguing are the conventional methodological tools, especially if we were to rethink them from a more advanced perspective. This can be beneficial in many ways, such as opening new avenues of data analytics and even synthesizing new data. This, however, is a story for another time...
It’s not the programming language, as some people may think. After all, if you know what you are doing, even a suboptimal language could be used without too much of an efficiency compromise. No, the biggest mistake people make, in my experience, is that they rely too much on libraries they find as well as the methods out there. This is not the worst part though. If someone relies excessively on predefined processes and methods, the chances of that person’s role getting automated by an A.I. are quite high. So, what can you do?
For starters, one needs to understand that both data science and artificial intelligence, like other modern fields, are in a state of flux. This means that what was considered gospel a few years back may be irrelevant in the near future, even if it is somewhat useful right now. Take Expert Systems, for example. These were all the rage during the time when A.I. came out as an independent field. However, nowadays, they are hardly used and in the near future, they may appear more anachronistic than ever before. That’s not to say that modern aspects of data science and A.I. are going to wane necessarily, but if one focuses too much on them, at the expense of the objective they are designed for, that person risks becoming obsolete as they become less relevant.
Of course, certain things may remain relevant no matter what. Regardless of how data science and A.I. evolve, the k-fold cross-validation method will be useful still. Same goes with certain evaluation metrics. So, how do you discern what is bound to remain relevant from what isn’t? Well, you can’t unless you try to innovate. If certain methods appear too simple, for example, they may not stick around for much longer, even if they linger in the textbooks. Do these methods have variants already that outperform the original algorithms? Are people developing similar methods to overcome drawbacks that they exhibit? What would you do if you were to improve these methods? Questions like this may be hard to answer because you won’t find the necessary info on Wikipedia or on StackOverflow, but they are worth thinking about for sure, even if an exact answer may elude you.
For example, I always thought that clustering had to be stochastic because everyone was telling me that it is an NP-hard problem that cannot be solved efficiently with a deterministic method. Well, with this mindset no innovations would ever take place in that method of unsupervised learning, would it? So, I questioned this matter and found out that not only are there ways to solve clustering in a deterministic way, but some of these methods are more stable than the stochastic ones. Are they easy? No. But they work. So, just like we tend to opt for mechanized transportation today, instead of the (much simpler) horse and carriage alternative, perhaps the more sophisticated clustering methods will prevail. But even if they don’t (after all, there are no limits to some people’s detest towards something new, especially if it’s difficult for them to understand), the fact that I’ve learned about them enables me to be more flexible if this change takes place. At the same time, I can be more prepared for other changes in the field, of a similar nature.
I am not against stochastic methods, by the way, but if an efficient deterministic solution exists for a problem, I see no reason why we should stick with a stochastic approach to that problem. However, for optimization related scenarios, especially those involving very complex problems, the stochastic approach may be the only viable option. Bottom line, we need to be flexible about these matters.
To sum up, learning about the conventional way of solving data-related problems, be it through data science methods, or via A.I. ones, is but the first step. Stopping there though would be a grave mistake, since you’d be depriving yourself the opportunity to delve deeper into the field and explore not only what’s feasible but also what’s possible. Isn’t that what science is about?
Even though this topic may be a bit polarizing, especially among people who are new to data science, knowing more about it can be very useful, particularly if you value a sense of perspective more than a good grade in some data science crash course. The latter is bound to overemphasize either Stats or AI, depending on the instructor's knowledge and experience. However, some data science professionals, myself included, prefer a more balanced approach on the topic. This is the reason why I decided to make this video, which is now available on Safari for your viewing.
Note that this is by no means a complete tutorial on the topic, but it is a good overview of the various aspects of statistics related to data science, along with some programming resources in both Python and Julia, to get you started. Enjoy!
A.I. and ML are often used interchangeably, while many people consider one to be a subset of the other (which one is the bigger set depends on who you ask). However, things may not be as clear-cut as they may seem, since the communities of these two fields are not all the related, while there is a sort of rivalry among the hard-core members of each one of them. Why is that though if A.I. and ML are so similar to each other, enough to confuse even data scientists?
First of all, let’s start with some definitions. A.I. is the group of methods, algorithms, and processes, that bring about computer systems that emulate human intelligence, even if the intelligence they usually exhibit is quite different to our own. Also, these systems often take the form of self-sufficient machines, such as robots, as well as agent programs that roam the Internet or cyber space in general. ML on the other hand is the group of methods, algorithms, and processes that bring about computer systems that solve some data analytics problem in an efficient manner, through some training procedure (the learning part of machine learning). The latter can be with the help of some specific outcomes (aka targets) or without. Also, the training can take the form of feedback on the system’s predictions, which is like on-the-job training of sorts.
Clearly, there is a close link between ML and data science, since ML systems are designed for this sort of problems. A.I. systems on the other hand, may tackle different kinds of problems too (e.g. finding the optimal route given some restrictions). So, there is a part of A.I. that is leveraged in data science and a part of A.I. that has nothing to do with our craft. That part of A.I. that is used in data science has a large intersect with ML, mainly through network-based systems, such as ANNs. Lately, Deep Learning networks, which are specialized and more sophisticated kinds of ANNs, have become quite popular and are also part of that intersect between A.I. and ML.
Many people who work in A.I. consider it more of a science than ML and they are right in a way. Most of ML methods are heuristics based and don’t have much theory behind them, while the ones that are tied to Stats (Statistical and ML hybrids) are heavily restrained by the assumptions that the Stats theory has. A.I. methods are generally data-driven though, but also related to processes found in nature, so they are not out of the blue.
Nevertheless, a data scientist who is being professional and pragmatic doesn’t put too much emphasis on the differences between A.I. and ML methods, since he cares more about how they can be applied to solve the problems at hand. So, even if these two families of methods are not the same, nor is one a subset of the other, they are both very useful, if not essential, in practical data science.
Recently a far-reaching scandal broke out as a reporter exposed a data science company called Cambridge Analytica. According to the information gathered, that company used a dataset harvested via Facebook and enriched with a lot of data from the Facebook graph too, in order to use it to affect the presidential elections of 2016 in the USA. It is important to note that the role of that project was not exploratory (e.g. related to the finding of insights related to the voters), but rather it aimed at steering the voters’ views on a certain candidate, in order to benefit the other candidate, which was the company’s client.
Personally I’m not vested in US politics and don’t have any strong views on the matter, which is why I chose to omit the names of the politicians involved. As a data science professional, however, I find what C.A. did was shameful and unethical, on many levels. Examples like this only go to show that just like everything else in applied science, data science can be used for malicious purposes too, something that every data scientist ought to be aware of and avoid whenever possible.
Also, a topic like this one concerns not just data scientists but anyone working alongside them, since it would be naive to believe that this whole fiasco was the result of a few data science professionals acting on their own. As the corresponding footage shows, the black-hat approach to data analytics was initiated by the company’s head, who was quite forth-coming about what the company was trying to do. That doesn’t make the data scientists working there innocent victims, but at least the responsibility of this dark project is shared among everyone there, not just them. Also, considering that it wasn’t a huge company, it’s quite unlikely that the data scientists weren’t aware of the unethical and immoral agenda their work was serving. However, it is clear that if they hadn’t cooperated with this plan, this could at the very least have slowed things down.
So, how can we guard ourselves from situations like that of the C.A. scandal, as data science professionals? First of all, we can avoid working for people who don’t have a moral compass and who are looking at how the data products developed can be used to covertly drive certain behaviors that if exposed, would be punishable. So, if the leaders of a project are shady individuals and don’t mind hurting others in order to make their clients happy, that’s a red flag.
The data itself could be another potential warning sign. If it is collected through unethical means and used in ways that compromise the people’s privacy, then that’s a tell-tale sign that there is something fishy going on. Another such sign is the insights discovered through such a project (in this case the categorization of the people involved into four groups that relate to some intimate aspects of their personalities). If we are not comfortable sharing these insights with those people (assuming that there were no NDA in place prohibiting that), because it just feels wrong, then we shouldn’t be digging up those insights to start with.
Finally, if the data products don’t serve the people involved in the data behind these products, even indirectly, then that’s another red flag. The products we create should be something we can talk about openly (without giving away any sensitive know-how behind them, of course), without feeling ashamed or guilty about their purpose.
Naturally, these few suggestions are but the tip of the iceberg of a very large topic related to the Ethics aspect of our profession. I cannot hope to do this topic justice through a blog article, or even a video like the one I made on this topic last year. However, it’s good to remember that we are not powerless against the malicious use of data science by people who are either immoral or amoral, caring only for themselves at the expense of the well-being of others. We may not always be able to stop their agenda, but we can at least identify an unethical project and not contribute to it. Besides, there are many things we can do with data science, so why not focus on the more beneficial ones instead?
When it comes to DS education, nowadays there is a lot of emphasis given in one of two things: the math aspect of it, and the complex algorithms of deep learning systems. Although all this is essential, particularly if you want to be a future-proof data science professional, there is much more to the field than that. Namely, the engineer mentality is something that you need to cultivate, since at its core, data science is an engineering discipline. I don’t mean that in a software manner, but more of a practicality and efficiency oriented approach to building a system.
This is largely due to the scaling dimension of a data science metric or model. Unfortunately most data science “educators” fail to elaborate on this point, since they focus mainly on parroting other people’s work, instead of inciting students to gain a deeper understanding of the methods and processes being taught. Also, scaling something is the filter that distinguishes a robust algorithm from a mediocre one. As we obtain more and more data, having an algorithm that works well on a small dataset only (or one that requires a great deal of parallelization to yield any benefits), is not sustainable. Of course some people are happy with that, since they have a great deal of resources available, which they are happy to rent out. However, we can often obtain good enough results with less resources, through algorithms that have better scaling. Even if most people don’t share this fox-like approach to data science, it doesn’t make it less relevant. After all, many people associate methods with the frameworks particular companies offer, rather than understand the science behind these methods.
Scaling a method up intelligently is the product of three things:
1. having a deep understanding of a method
2. not relying on an abundance of resources to scale it up
3. being creative about the method, making compromises where necessary, to make it more lightweight
That’s where the engineering mentality comes it. The engineer understands the math, but isn’t concerned about having the perfect solution to a problem. Instead, he cares about having a good enough solution that is reliable and not too costly.
This kind of thinking is what drives the development of modern optimization systems, which are an important part of AI. Artificial Intelligence may involve things like deep learning networks, but there is more to it than that. So, if you want to delve more into this field and its numerous applications in data science, cultivating this engineering mentality is the optimal way to go. Perhaps not the absolute best one, but definitely one that works well and is efficient enough!
For the past few months I've been working on a tutorial on the data modeling part of the data science process. Recently I've finished it and as of 2 weeks ago, it available online at the Safari portal. Although this tutorial is mainly for newcomers to the field, everyone can benefit from it, particularly people who are interested in not just the technical aspects but also on the concepts behind them and how it all relates to the other parts of the pipeline. Enjoy!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.