Even if you are not a Bayesian Stats fan, it’s not hard to appreciate this data analytics framework. In fact, it would irresponsible if you were to disregard it without delving into it, at least to some extent. Nevertheless, the fact is that Frequentist Stats (see image above), as well as Machine Learning, are more popular in data science. Let's explore the reasons why this is.
Bayesian Stats relies primarily on the various versions of the Bayes Theorem. In a nutshell, this theorem states that if we have some idea of the a priori probabilities of an event A happening (as well as A not happening), as well as the likelihoods of event B happening given event A happening (as well as A not happening), we can estimate the probability of A given B. This is useful in a variety of cases, particularly when we don't have a great deal of data at our disposal. However, there is something often hard to gauge and it's the Achilles heel of Bayesian Stats. Namely, the a priori probabilities of A (aka the priors) are not always known while when they are, they are usually rough estimates. Of course, this isn't a showstopper for a Bayesian Stats analysis, but it is a weak point that many people are not comfortable with since it introduces an element of subjectivity to the whole analysis.
In Frequentist Stats, there are no priors and the whole framework has an objective approach to things. This may seem a bit far-fetched at times since lots of assumptions are often made but at least most people are comfortable with these assumptions. In Machine Learning, the number of assumptions is significantly smaller as it's a data-driven approach to analytics, making things easier in many ways.
Another matter that makes Bayesian Stats not preferable for many people is the lack of proper education around this subject. Although it predates Frequentist Stats, Bayesian Stats never got enough traction in people's minds. The fact that Frequentist Stats was advocated by a very charismatic individual who was also a great data analyst (Ronald A. Fisher) may have contributed to that. Also, the people who embraced the different types of Statistics at the time augmented the frameworks with certain worldviews, making them more like ideological stances than anything else. As a result, since most people who worked in data analytics at the time were more partial towards Fisher's worldview, it made more sense for them to advocate Frequentist Stats. The fact that Thomas Bayes was a man of the cloth may have dissuaded some people from supporting his Statistics framework.
Finally, Bayesian Stats involves a lot of advanced math when it is applied to continuous variables. As the latter scenario is quite common in most data analytics projects, Bayesian Stats ends up being a fairly esoteric discipline. The latter entails things like Monte Carlo simulations (which although fairly straightforward, they are not as simple as distribution plots and probability tables) and Markov Chains. Also, there are lots of lesser-known distributions used in Bayesian Stats (e.g. Poisson, Beta, and Gamma, just to name a few) that are not as simple or elegant as the Normal (Gaussian) distribution or the Student (t) distribution that are bread and butter for Frequentist Stats. That's not to say that the latter is a walk in the park, but it's more accessible to a beginner in data analytics. As for Machine Learning, contrary to what many people think, it too is fairly accessible, especially if you use a reliable source such as a course, a book, or even an educational video, etc. with a price tag accompanying it.
Summing up, Bayesian Statistics is a great tool that’s worth exploring. If, however, you find that most data analytics professionals don’t share your enthusiasm towards it, don’t be dismayed. This is something natural as the alternative frameworks maintain an advantage over Bayesian Stats.
Lately, there has been a lot of talk about the Corona Virus disease (Covid-19) and Italy is allegedly a hotspot. As my partner lives in Italy and is constantly bombarded by warnings about potential infections and other alarming news like that, I figured it would be appropriate to do some back-of-the-envelop calculations about this situation and put things in perspective a bit. After all, Bologna (the city where she lives) is not officially a "red zone" like Milan and a few other cities in the country.
For this analysis, I used Bayes' Theorem (see formula below) along with some figures I managed to dig up, regarding the virus in the greater Bologna area. The numbers may not be 100% accurate but they are the best I could find, while the assumption made was more than generous.
Namely, I used the latest numbers regarding the spread of the disease as the priors, while regarding the likelihoods (conditional probabilities regarding the test made) I had to use two figures, one from the Journal of Radiology to figure out the false positives rate (5 out of 167 or about 3%, in a particular study) and one for the true positive rate (aka precision), the aforementioned assumption, namely 99%. In reality, this number is bound to be lower but for the sake of argument, let's say that it's correct 99% of the time. Note that certain tests regarding the Covid-19 using CT scans can be as low as 80%, while the test kits available in some countries have even lower precision. For the priors, I used the data reported in the newspaper, namely around 40 for the greater Bologna area. The latter has a population of about 400 000 people (including the suburbs). So, given all that, what are the chances you actually have the virus if you do a test for it the result comes back positive?
Well, by doing the math on Bayes’ theorem, it can take the form:
P(infection | positive) = P(positive | infection) * P(infection) / [P(positive | infection) * P(infection) + P(positive | healthy) * P(healthy)]
As being infected and being healthy are mutually exclusive, we can say that P(healthy) = 1 – P(infection). Doing some more math on this we end up with this slightly more elegant formula:
P(infection | positive) = 1 / [1 + λ (1 / P(infection) – 1)] where λ = P(positive | healthy) / P(positive | infection).
Plugging in all the numbers we end up with: P(infection | positive) = 1 / (1 + 303) = 0.3% (!)
In other words, even if you do a proper test for Covid-19, and the test is positive (i.e. the doctor tells you “you’re infected”) the chances of this being true are about 1 in 300. This is roughly equivalent to rolling a triple 1 using 3 dice (i.e. you roll three dice and the outcome is 1-1-1). Of course, if you don’t test positive, the chances of you having the virus are much lower.
Note that the above analysis is for the city of Bologna and that for other cities you'll need to update the formula with the numbers that apply there. However, even if the scope of this analysis is limited to the greater Bologna area, it goes on to show that this whole situation that plagues Italy is more fear-mongering than anything else. Nevertheless, it is advisable to be mindful of your health as during times of changing weather (and climate), your immune system may need some help to ensure it keeps your body healthy, so anything you do to help it is a plus. Things like exercise, a good diet, exposure to the sun, keeping stress at bay, and maintaining good body hygiene are essential regardless of what pathogens may or may not threaten your well-being. Stay healthy!
(image by Arek Socha, available at pixabay)
Lately, I've been working on the final parts of my latest book, which is contracted for the end of Spring this year. As this is probably going to be my last technical book for the foreseeable future, I'd like to put my best into it, given the available resources of time and energy. This is one of the reasons I haven't been very active on this blog as of late. In this book (whose details I’m going to reveal when it’s in the printing press) I examine various aspects of data science in a quite hands-on way. One of these aspects, which I often talk about with my mentees, is that of scale.
Scaling is very important in data science projects, particularly those involving distance-based metrics. Although the latter may be a bit niche from a modern standpoint where A.I. based systems are often the go-to option, there is still a lot of value in distances as they are usually the prima materia of almost all similarity metrics. Similarity-based systems, aka transductive systems, are quite popular even in this era of A.I. based models. This is particularly the case in clustering problems, whereby both the clustering algorithms and the evaluation metrics (e.g. Silhouette score/width) are based on distances for evaluating cluster affinity. Also, certain dimensionality reduction methods like Principle Components Analysis (PCA) often require a certain kind of scaling to function optimally.
Scaling is not as simple as it may first seem. After all, it greatly depends on the application as well as the data itself (something not everyone is aware of since the way scaling/normalization is treated in data science educational material is somewhat superficial). For example, you can have a fixed range scaling process or a fixed center one. You can even have a fixed range and fixed center one at the same time if you wish, though it's not something you'd normally see anywhere. Fixed scaling is usually in the [0, 1] interval and it involves scaling the data so that its range is constant. The center point of that data (usually measured with the arithmetic mean/average), however, could be distorted. How much so depends on the structure of the data. As for the fixed center scaling, this ensures that the center of the scaled variable is a given value, usually 0. In many cases, the spread of the scaled data is fixed too, usually by setting the standard deviation to 1.
Programmatic methods for performing scaling vary, perhaps more than the Stats educators will have you think. For example, in the fixed range scaling, you could use the min-max normalization (aka 0-1 normalization, a term that shows both limited understanding of the topic and vagueness), or you could use a non-linear function that is also bound by these values. The advantage of the latter is that you can mitigate the effect of any outliers, without having to eradicate them, all through the use of good old-fashioned Math! Naturally, most Stats educators shy away at the mention of the word non-linear since they like to keep things simple (perhaps too simple) so don’t expect to learn about this kind of fixed-range scaling in a Stats book.
All in all, scaling is something worth keeping in mind when dealing with data, particularly when using a distance-based method or a dimensionality reduction process like PCA. Naturally, there is more to the topic than meets the eye, plus as a process, it's not as basic as it may seem through the lens of package documentation or a Stats book. Whatever the case, it's something worth utilizing, always in tandem with other data engineering tools to ensure a better quality data science project.
As mentioned in a previous post, translinearity is a concept describing the fluidity of the linear and the non-linear, as they are combined in a unified framework. However, linear relationships are still valuable, particularly if you want to develop a robust model. It's just that the rigid classification between linear and non-linear is arbitrary and meaningless when it comes to such a model. To clarify this whole matter I started exploring it further and developed an interesting heuristic to measure the level of non-linearity on a scale that's intuitive and useful.
So, let's start with a single feature or variable. How does it fare by itself in terms of linearity and non-linearity? A statistician will probably tell you that this sort of question is meaningless since the indoctrination he/she has received would make it impossible to ask anything that's not within a Stats course's curriculum. However, the question is meaningful even though it's not as useful as the follow-up questions that can ensue. So, depending on the data in that feature, it can be linear, super-linear, or sub-linear, in various degrees. The Index of Non-Linearity (INL) metric gauges that and through the values it takes (ranging from -1 to 1, inclusive) we can assess what a feature is like on its own. Naturally, these scores can be easily shifted by a non-linear operator (e.g. sqrt(x) or exp(x)) while all linear operators (e.g. standard normalization methods) do not affect these scores. Also, at the current implementation of INL, the value of the heuristic is calculated using three reference points in the variable.
Having established that, we can proceed to explore how a feature fares in relation to another variable (e.g. the target variable in a predictive analytics setting). Usually, the feature is used as the independent variable and the other variable as the dependent one, though you can explore the reverse relationship too, using this same heuristic. Interestingly the problem is not as simple now because the two variables need to be viewed in tandem. That's why all the reference points used shift if we change the order of the variables (i.e. the heuristic is not symmetric). Whatever the case, it is still possible to calculate INL with the same idea but taking into account the reference values of both variables. In the current implementation of the heuristic, the values can go a bit off-limits, which is why they are bound artificially to the [-1, 1] range.
Naturally, metrics like INL are just the tip of the iceberg in this deep concept. However, the existence of INL illustrates that it is possible to devise heuristics for every concept in data science, as long as we are open to the possibilities the data world offers. Not everything has been analyzed through Stats, which despite its indisputable value as a data science tool, it is still just one framework, a singular way of looking at things. Fortunately, the data-scapes of data science can be viewed in many more ways leading to intriguing possibilities worth exploring.
Everyone in data science (and even beyond data science to some extent) is familiar with the process of sampling. It’s such a fundamental method in data analytics that it’s hard to be unaware of it. The fact that’s so intuitive as well makes it even easier to comprehend and apply. Besides, in the world of Big Data, sampling seems to be not only useful but also necessary! What about data summarization though? How does that fit in data science and how does it differ from sampling?
Both data summarization and sampling aim to reduce the number of data points in the data set. However, they go about it in very different ways. For starters, sampling usually picks the data points randomly while in some cases, it takes into account an additional variable (usually the target variable). The latter is the case of stratified sampling, something essential if you want to perform proper K-fold cross-validation for a classification problem. Data summarization, on the other hand, creates new data points that aim to contain the same information as the original dataset, or at least retain as much of it as possible.
Another important difference between the two methodologies is that data summarization tends to be deterministic, while sampling is highly stochastic. This means that you cannot use data summarization instead of sampling, at least not repeatedly as in the case of K-fold cross-validation. Otherwise, you’ll end up with the same results every time, something that doesn’t help with the validation of the models at hand! Perhaps that’s one of the reasons why data summarization is not so widely known in the data science community, where model validation is a key focus of data science work.
What’s more, if sampling is done properly, it can maintain the relationships among the variables at hand (obviously this would entail the use of some heuristics since random sampling alone won’t cut it). Data summarization, on the other hand, doesn't do that so well, partly because it focuses on the most important aspects of the dataset, discarding everything else. This results in skewing the variable relationships a bit, much like a PCA method changes the data completely when it is applied. So, if you care about maintaining these variable correlations, data summarization is not the way to go.
Finally, due to the nature of the data involved, data summarization could be used for data anonymization and even data generation. Sampling, however, wouldn't work so well for these sorts of tasks, even though it could be used for data generation if the sampling is free of biases (something which can also be attained if certain heuristics are applied). All this illustrates the point that although these two methods are quite different, they are also applicable in different use cases so they don’t exactly compete with each other. It’s up to the discerning data scientist to figure out when to use which, adding value to the project at hand.
Throughout this blog, I've talked about all sorts of problems and how solving them can aid one's data science acumen as well as the development of the data science mindset. Problem-Solving skills rank high when it comes to the soft skills aspect of our craft, something I also mentioned in my latest video on O'Reilly. However, I haven't talked much about how you can hone this ability.
Enter Brilliant, a portal for all sorts of STEM-related courses and puzzles that can help you develop problem-solving, among other things. If you have even a vague interest in Math and the positive Sciences, Brilliant can help you grow this into a passion and even a skill-set in these disciplines. The most intriguing thing about all this is that it does so in a fun and engaging way.
Naturally, most of the stuff Brilliant offers comes with a price tag (if it didn't, I would be concerned!). However, the cost of using the resources this site offers is a quite reasonable one and overall good value for money. The best part is that by signing up there you can also help me cover some of the expenses of this blog, as long as you use this link here: www.brilliant.org/fds (FDS stands for Foxy Data Science, by the way). Also, if you are among the first 200 people to sign up you'll get a 20% discount, so time is definitely of the essence!
Note that I normally don't promote anything of this blog unless I'm certain about its quality standard. Also, out of respect for your time I refrain from posting any ads on the site. So, whenever I post something like this affiliate link here I do so after careful consideration, opting to find the best way to raise some revenue for the site all while providing you with something useful and relevant to it. I hope that you view this initiative the same way.
So, the 7th quiz video I've created is finally online on O'Reilly. This is the longest one so far spanning over 51 minutes, meaning there are lots of explanations for the various questions. It covers a bunch of topics, such as A/B testing, ANOVA, and various statistical tests. I put a lot of thought in this, much like you'd put a lot of thought in designing a data science experiment. Hopefully, you'll find it as useful and enjoyable as I did.
Note that just like other videos published on O'Reilly, you'll need to have an active account (even if it's a trial one), in order to view it in its entirety. As a bonus, you'll be able to view other videos as well as books available on that platform. Enjoy!
Dimensionality reduction has been a standard methodology to deal with datasets that have a lot of features, more than a typical model can handle effectively. Reducing the number of features can also save time and storage space, while when it comes to sensitive data it can be a big plus as it enables anonymity in the people involved. What’s more, in some cases, a reduced dimensionality dataset can be more effective as there is less noise in it. However, conventional dimensionality reduction methods don’t always do the trick due to the inherent limitations they have. For example, PCA only considers linear relationships among the variables and a linear combination of features, as a solution.
Of course, other people are not sitting idle when it comes to this issue. There are several dimensionality reduction options that are being pursued, the most interesting of which is autoencoders. This AI-based method involves a data-driven approach to figuring out the nature of the data and creating new variables that can represent the underlying signal, by minimizing the error. The issue with this is that it often requires a lot of data and some specialized know-how in order to configure optimally. Also, this whole process may be fairly slow, due to the large number of computations involved.
An alternative approach has to do with feature fusion in a non-AI way. The idea is to maintain transparency to the extent this is possible, while at the same time optimize the whole process in terms of speed. The use of multiple operators, some linear and some non-linear, is essential, while the option of dropping useless features is also very useful. Naturally, this whole process would be more effective in the presence of a target variable, but it should be able to work without it, for better applicability. Whatever the case, the use of a metric able to handle non-linear correlations is paramount since the conventional correlation metric used leaves a lot to be desired.
Based on all this, it’s clear that the dimensionality reduction area is still capable of enhancements. Despite the great work that has been done already, there is still room for new methods that can address the limitations the existing methods have, which aren’t going away any time soon. Perhaps it would be best to explore this methodology of data engineering more, instead of focusing the latest and greatest system, which although intriguing, may sacrifice too much (e.g. transparency) in the name of accuracy, a trade-off that may no longer be cost-effective. Something to think about...
(Image by lazyprogrammer.me)
PCA has attracted a lot of questions among all of my mentees over the years, so I decided to make a fairly in-depth video on the topic. Unlike other education material on PCA, this one is light on the math, while there is a lot of emphasis on the concepts as well as how they apply to a data scientist's work. You can check out the video on Safari here.
Note that in order to view the video in its entirety you'll need a subscription to the Safari platform. Cheers!
Lately it’s hard to find someone who is a legit data scientist and yet doesn’t talk about Stats as if it’s a new religion or something. Don’t get me wrong; I find Stats a very useful tool in data analytics, especially data science. However, there are other, usually most suitable options out there to have in one’s data science toolbox.
First of all, Statistics is the state-of-the-art approach to data modeling, if you live in the mid 20th century. In our time, Stats, particularly frequentist Stats, is greatly outdated and many of the assumptions it makes about the data don’t make any sense. Also, transforming the data so that it fits the assumptions many Stats models make, is a time-consuming process which may or may not be worth the trouble. Of course if you know nothing else, or have trust issues with novel modeling options, then Stats may be the best option for you. In this case, however, it is best to brand yourself as a Stats professional instead of a data scientist, since the latter implies that you do more than just Stats.
In addition, pretty much all of the metrics used in Stats can be improved heavily by negating the normality assumption. The more data I come across, the more certain I am that this assumption may make sense in some cases, but in the majority of cases it doesn’t hold. So, using metrics that have this assumption embedded in them doesn’t really help anyone. What’s more, all this framework inevitably shapes one’s mindset and so if you get used to the unreasonable assumptions Stats usually makes about the data, you may not be able to think of the data in a different way.
Moreover, with the advent of A.I., especially the A.I. that’s directly applicable to data science, the data transformation and modeling options available to data scientists have increased dramatically. So, relying on Stats is more of a preference rather than a necessity. Besides, it’s extremely unlikely that a Stats model will be able to outperform an A.I. one when the latter is well configured.
Finally, there are other new data analytics methods waiting to be discovered and used in data science. Heuristics have made a comeback and are more and more popular in data science research, especially when it comes to complex datasets. So, sticking to Stats when there is a plethora of possibilities out there that can tackle a problem more effectively is just depressing.
Having said all that, Stats is a useful subject to learn, as it can aid one’s learning of the data science craft. Much like learning basic Mechanics can be useful if you want to being a Physics professional, learning Stats can be quite useful. Sticking to it and thinking of it as gospel, however, is not. That’s why after learning about it, it’s best to seek to expand your understanding of data analytics through delving into other frameworks, such as Machine Learning, A.I. based systems, and heuristics. Stats is just one of the tools available in the data scientist's toolbox...
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.