Although the debate between Frequentist and Bayesian statisticians sometimes takes a more comical turn (XKCD strip), it is still important for a data scientist to know a few things about Bayesian Stats. Of course, purists of the craft will argue that Frequentist Stats will suffice but if you want to stand out of the crowd, it would definitely help going beyond the beaten path, when it comes to data analytics know-how.
This video I made recently highlights the key elements of Bayesian Stats, focusing on the concepts that although fairly straight-forward, may be obscure to the newcomer. Also, without disregarding the invaluable contribution of Frequentist Stats to data science, this video explores how the two differ and how Bayesian Stats has a lot in common with other, more modern, data analytics frameworks. Check it out when you get the chance!
Note that a subscription to the Safari platform is necessary in order to view the video in its entirety.
Dichotomy: a binary separation of a set into two mutually exclusive subsets
Data Science: the interdisciplinary field for analyzing data, building models, and bringing about insights and/or data products, which add value to an organization. Data science makes use of various frameworks and methodologies, including (but not limited) to Stats, ML, and A.I.
After getting these pesky definitions out of the way, in an effort to mitigate the chances of misunderstandings, let’s get to the gist of this fairly controversial topic. For starters, all this information here is for educational purposes and shouldn’t be taken as gospel since in data science there is plenty of room for experimentation and someone adept in it doesn’t need to abide to this taxonomy or any rules deriving from it.
The inaccurate dichotomy issues in data science, however, can be quite problematic for newcomers to the field as well as for managers involved in data related processes. After all, in order to learn about this field a considerable amount of time is required, something that is not within the temporal budget of most people involved in data science, particularly those who are starting off now. So, let’s get some misconceptions out of the way so that your understanding of the field is not contaminated by the garbage that roams the web, especially the social media, when it comes to data science.
Namely, there are (mis-)infographics out there that state that Stats and ML are mutually exclusive, or that there is no overlap between non-AI methods and ML. In other words, ML is part of AI, something that is considered blasphemy in the ML community. The reason is simple: ML as a field was developed independently of AI and has its own applications. AI can greatly facilitate ML through its various network-based models (among other systems), but ML stands on its own. After all, many ML models are not AI related, even if AI can be used to improve them in various ways. So, there is an overlap between ML and AI, but there are non-AI models that are under the ML umbrella.
Same goes with Statistics. This proud sub-field of Mathematics has been the main framework for data analytics for a long time before ML started to appear, revolting against the model-based approach dictated by Stats. However, things aren’t that clear-cut. Even if the majority of Stats models are model-based, there are also models that are hybrid, having elements of Stats and ML. Take Bayesian Networks for example, or some variants of the Naive Bayes model. Although these models are inherently Statistical, they have enough elements of ML that they can be considered ML models too. In other words, they lie on the nexus of the two sets of methods.
What about Stats and AI? Well, Variational AutoEncoders (VAEs) are an AI-based model for dimensionality reduction and data generation. So, there is no doubt that it lies within the AI set. However, if you look under the hood you’ll see that it makes use of Stats for the figuring out what the data generated by it would be like. Specifically, it makes use of distributions, a fundamentally statistical concept, for the understanding and the generation of the data involved. So, it wouldn’t be far-fetched to put VAEs in the Stats set too.
From all this I hope it becomes clear that the taxonomy of data science models isn’t that rigid as it may seem. If there was a time when this rigid separation of models made sense, this time is now gone as hybrid systems are becoming more and more popular, while at the same time the ML field expands in various directions outside AI. So, I’d recommend you take those (mis-)infographics with a pinch of salt. After all, most likely they were created by some overworked employee (perhaps an intern) with a limited understanding of data science.
A famous scientist from the Quantum Physics school of thought once said “asking the right question is more than halfway towards finding the answer.” Although it’s been years since I read this quote (which I may be paraphrasing, by the way), it still echoes a deep truth and helps guide my (non-academic) research in the data science and A.I. fields. So, I few weeks ago I put forward the question “what would a statistical framework framed around possibilities be like?”
At first glance, such a question may seem nonsensical since from an early age we’ve all be taught the core aspects of Stats and how it’s all about probabilities. There is no doubt that the probabilistic approach to modeling uncertainty has yielded a lot of fruits as the field grew, but all developments of Statistical methods were bound by the limitations of the assumptions made, mirrored by the various distributions used. In other words, if you want results with conventional Stats, you’ve got to use this or the other distribution and keep in mind that if the data you have doesn’t follow the distribution assumed, the results may not be reliable. What if the field of Stats was void of such restrictions by assuming a membership function instead of a distribution, to describe the data at hand?
I’m not going to describe in length where this rabbit hole leads, but suffice to say that the preliminary results of a framework based on this alternative approach exceeded my expectations. Also, there is no Stats process that I looked at which could not be replicated with the possibilistic approach. What’s more, since the possibilistic approach to data analytics is one of the oldest forms of A.I., it is sensible to say that such a statistical framework would be in essence AI-based, though not related to deep learning, since that’s a completely different approach to A.I. that has its own set of benefits. Nevertheless, I found that having a statistical framework that borrows an A.I. concept in its core, can provide an interesting way to bridge the gap between Stats-based data analytics and modern / A.I. based.
What’s even more interesting is that this can be a two-way street, with A.I. also being able to benefit from such a nexus between the two fields. After all, one of the biggest pain points of modern A.I. is the lack of transparency, something that’s a freebie when it comes to Stats modeling. So, an A.I. system that has elements of Stats at its core may indeed be a transparent one. However, this idea is still highly experimental, so it would be best to not discuss it further here.
Whatever the case, I have no doubt that the possibilistic approach to data has a lot of merit and hasn’t been explored enough. So, it is possible that it has a role to play in more modern data analytics systems. The question is, are you willing to accept this possibility?
Before someone says “yes, of course; you just need to apply a non-linear transformation to one of the variables!”, let me rephrase: can we measure a non-linear relationship between two variables, without any transformations whatsoever? In other words, is there a heuristic metric that can facilitate the task of establishing whether two variables are linked in some fashion, without any data engineering from our part?
The answer is “yes, of course” again. However, the relationship has to be monotonous for this to work. In other words, there needs to be a 1-1 relationship between the values of the two variables. Otherwise, it may not appear as strong, due to the nature of non-linearity.
So, if we have two variables x and y, and y is something like x^10 + exp(x), that’s a relationship that is clearly non-linear, but also monotonous. Also, the Pearson correlation of the two variables in this case is not particularly strong (for the variables tested, it was about 0.67). If it were measured by a different correlation metric, however, like a custom-built one I’ve recently developed, the relationship would be somewhat stronger (for these variables, it would be around 0.75) while Kendall's ranked correlation coefficient would produce a great result too (1.00 for these variables).
In a different scenario, where z = 1 / x, for example, the results of the correlation metrics differ more. Pearson’s correlation in this case would be something like -0.16, while the custom-made metric would yield something around -0.69. Also, Kendall’s coefficient would be -1.00.
Although the effect is not always pronounced, in cases like this one, a different metric can make the difference between a strong correlation and a not-so-strong one, affecting our decisions about the variables.
Bottom line, even if the Pearson correlation coefficient is the most popular method for measuring the relationship between two variables, it’s not the best choice when it comes to non-linear relationships. That’s why different metrics need to be used for evaluating the relationship between two variables, particularly if it’s a non-linear one.
Although I’ve talked about dimensionality reduction for data science in the corresponding video on Safari, covering various angles of the topic, I was never fully content with the methodologies out there. After all, all the good ones are fairly sophisticated, while all the easier ones are quite limited. Could there be a different (better) way of performing dimensionality reduction in a dataset? If so, what issue would such a method tackle?
First of all, conventional dimensionality reduction methods tend to come from Statistics. That’s great if the dataset is fairly simple, but methods like PCA focus on the linear relationships among the features, which although it’s a good place to start, it doesn’t cover all the bases. For example, what if features F1 and F2 have a non-linear relationship? Will PCA be able to spot that? Probably not, unless there is a strong linear component to it. Also, if F1 and F2 follow some strange distribution, the PCA method won’t work very well either.
What's more, what if you want to have meta-features that are independent to each other, yet still explain a lot of variance? Clearly PCA won’t always give you this sort of results, since for complex datasets the PCs will end up being tangled themselves. Also, ICA, a method designed for independent components, is not as easy to use since it’s hard to figure out exactly where to stop when it comes to selecting meta-features.
In addition, what’s the deal with outliers in the features? Surely they affect the end result, by changing the whole landscape of the features, breaking the whole scale equilibrium at times. Well, that’s one of the weak point of PCA and similar dimensionality reduction methods, since they require some data engineering before they can do their magic.
Finally, how much does each one of the original features contribute to the meta-features you end up with after using PCA? That’s a question that few people can answer although the answer is right there in front of them. Also, such a piece of information may be useful in evaluating the original features or providing some explanation of how much they are worth in terms of predictive potential, after the meta-features are used in a model.
All of these issues and more can be tackled by using a new approach to dimensionality reduction, one that is based on a new paradigm (the same one that can tackled the clustering issues mentioned in the previous post). Also, even though the new approach doesn’t use a network architecture, it can still be considered a type of A.I. as there is some kind of optimization involved. As for the specifics of the new approach, that’s something to be discussed in another post, when the time is right...
A/B testing is a crucial methodology / application in the data science field. Although it mainly relies on Statistics, it has a remained quite relevant in this machine learning and AI oriented era of our field. It's no coincidence that in Thinkful that's one of the first things data science students learn, once they get comfortable with descriptive Stats and basic data manipulation. So, I decided to do a video on this topic to help those interested in learning about it get a good perspective of it and understand better its relationship with Hypothesis Testing. It is my hope that this video can be a good supplement to one's learning on the subject. Enjoy!
Even though this topic may be a bit polarizing, especially among people who are new to data science, knowing more about it can be very useful, particularly if you value a sense of perspective more than a good grade in some data science crash course. The latter is bound to overemphasize either Stats or AI, depending on the instructor's knowledge and experience. However, some data science professionals, myself included, prefer a more balanced approach on the topic. This is the reason why I decided to make this video, which is now available on Safari for your viewing.
Note that this is by no means a complete tutorial on the topic, but it is a good overview of the various aspects of statistics related to data science, along with some programming resources in both Python and Julia, to get you started. Enjoy!
After investigating this topic quite a bit, as I was looking into A.I. stuff, I decided to create a video on it. To make it more complete, I included other methods too, such as Statistics-based and heuristics-based ones. Despite the excessive amount of content I put together into this project (the script was over 4000 words), I managed to keep the video at a manageable length (a bit less than half an hour). Check it out on Safari when you have some time!
I've mentioned both in the DS Modeling Tutorial and in another article of mine the importance of discretization / binning of a continuous variable, as a strategy for turning it into a feature, to be used in a data model. However, how meaningful and information-rich the resulting categorical feature is going to be depends on the thresholds we use. In this post I'd like to share with you a strategy that I've come up with that works well in doing just that.
First of all, we need to make sure we have a potent method for calculating the density of a data point. I'm not talking about probability density though, since the latter is a statistical concept that has more to do with the mathematical form of a distribution than the actual density observed. The actual density is what we would measure if we were to look at the data itself and although it's quite straight-forward, it's not as easy to do at scale. That's why I first developed a very simple (almost simplistic) method for approximating density using a sampling of sorts, rather than looking at each individual element in the variable.
Afterwards, we just need to figure out the point of least density, that's not an extreme of the variable. In other words, identity of a local minimum in the density distribution, a fairly easy task that's also computationally cheap. Of course it's good to have a threshold too, to distinguish between this point being an actual low-density point and one that could be due to chance. If the density of that point is below this threshold, we can take it to be a point of dissection for the variable, effectively binarizing it.
Beyond that, we can repeat the same process recursively, for the two partitions of the variable. This way, we can end up with 3, 4, or even 100 partitions at the end of the process. This is another reason why this aforementioned threshold is very important. After all, not all partitions would be binarizable in a meaningful way. Also, it would be a good idea to have a limit to how many partitions overall we allow, so that we don't end up with a categorical variable having 1000 unique values either!
This optimal discretization / binning process is very simple and robust, resulting into a simpler form of the original variable, one that can be broken down to a set of binary features afterwards, if needed. This can also be useful in identifying potential outliers and being able to use them (as separate values in the new feature) instead of discarding them. The method is made even faster through its implementation in Julia, which once again proved itself as a great DS tool.
The idea of sampling is fundamental in data science and even though it is taught in every data science book or course out there, there is still a lot to be learned about it. The reason is that sampling is a very deep topic and just like every other data-related topic out there, conventional Statistics fails to do it justice. The reason is simple: good quality samples come about by obtaining an unbiased representation of a population and this is rarely the case from strictly random samples. Also, the fact that Statistics doesn’t offer any metric whatsoever regarding bias in a sample, doesn’t help the whole situation.
The Index of Bias (IB) of a Sample
There are two distinct aspects of a sample, its bias and its diversity. Here we’ll explore the former, as it is expressed in the two fundamental aspects of a distribution, its central point and its spread. For these aspects we’ll use two robust and fairly stable metrics, the median and the inter-quartile range, respectively. The deviation of a sample in terms of these metrics, with each deviation normalized based on the maximum deviation possible for the given data, yields two metrics, one of the central point and one for the spread. Each metric takes values between 0 and 1, inclusive. The average of these metrics is defined as the index of bias of a sample and takes values in the [0, 1] interval too. Note that the index of bias is always in relation to the original dataset we take the sample from.
Although the above definition applies to one-dimensional data only, it can be generalized to n-dimensional data too. For example, we can define the index of bias of a dataset comprising of d dimensions (features) as the arithmetic mean of the index of bias of each one of its features.
IB Scores for Various Samples
Strictly random samples tend to have a fairly high IB score, considering that we expect them to be unbiased. That’s not to say that they are always very biased, but they are definitely in need of improvement. Naturally, if the data we are sampling is multi-dimensional, the chances of a bias are higher, resulting to an overall biased sample.
Samples that are engineered with IB in mind are more likely to be unbiased in that sense. Naturally, this takes a bit of effort. Still, given enough random samples, it is possible to get a good enough sample that is unbiased based on this metric. In the attached file I include the IB scores of various samples, for both a random sampling process (first column) and a more meticulous one that aims to obtain a less biased sample (second column). Note that the latter did not use the IB metric in its algorithm, though a variant of it that makes use of that metric is also available (not free though). Also, you don’t need to be an expert in statistical tests to see that the second sampling method is consistently better than the first one. Finally, I did other tests on different data, and in every case, the results were very similar.
Hopefully, this small experiment goes on to show how sampling is not a trivial problem as it is made out to be by those who follow some old-fashioned paradigm for data analytics. Despite its simplicity, sampling has a lot of facets that need to be explored and understood, before it can be leveraged as a data science technique. After all, what good is an advanced model if the data it is trained on is biased? I believe we owe it to ourselves as data scientists to pay attention to every part of the data science process, including the less interesting parts, such as sampling.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.