Scientific knowledge is a greatly misunderstood matter, especially today. As we are bombarded with scientific innovations regularly and see scientists getting featured in various media, or even have films made about them (e.g. the classic "A Beautiful Mind" and "The Theory of Everything"), we may conclude that science is easy or that anyone with enough determination and some intelligence can make it in the scientific world. However, science is anything but easy, while someone's mental prowess and willpower although they play an important role in all this, are not the best predictors of their success. Sometimes, people are just at the right place at the right time (as in the case of Einstein). In any case, to get a better understanding of all this, let's break it down to its fundamental components through a highlevel model of sorts and see how they come into place to make scientific knowledge come about. First of all, scientific knowledge is the knowledge that comes about through the scientific method, preexisting knowledge or information (e.g. through observations or raw data) and concerning a particular problem. The latter may be something concrete (e.g. a machine that can transform chemical energy into work) or abstract (e.g. a mathematical model that explains how two variables relate to each other and how one of them can act as a predictor for the other). A problem may also be a weakness of an existing theory or model, that requires further understanding before it can more widely useful. Scientific knowledge has three primary aspects: research, fidelity, and application. Research has to do with the integration of information into a theory and/or new knowledge that supplements existing knowledge. This doesn't have to be groundbreaking since even a metaanalysis on a subject can provide crucial insight in understanding the problem at hand from a more holistic perspective. Perhaps that's why the first steps in a scientific project involve exploration and a critical analysis of the literature. In any case, it's hard to imagine scientific knowledge without research at its core since otherwise, it can become static, dogmatic and even superficial. The variety of different approaches to research and the value people place on it in all scientific institutions (including the nonacademic ones) attests to that. Fidelity has to do with developing confidence in something, particularly something new or different from what's already known. If the product of one's research doesn't carry confidence with it and it's just speculation or a thin interpretation of the data, it doesn't provide much usefulness. A scientist needs to attack the new knowledge with everything she's got to ensure that it holds water. That's why experimentation is so important as well as in some cases, peer review. The latter is very useful though not always essential since if the experiments are carried out properly and the scientist has no vested interest in the new knowledge (i.e. his intentions are pure), if there are issues with it they will surface sooner rather than later. If the new knowledge remains firm, the fidelity of it will grow along with the confidence of the scientist in what it can do. Naturally, this confidence level will never reach 100% since in science there is always room for disproving something. This brings us to the next aspect: application. This involves the application of this new knowledge into a realworld problem or some other situation that's somewhat different from the original one. It has to do with making predictions using the new theory, predictions that are of some value to someone beyond the scientist. "Application is the ultimate and most sacred form of theory," a Greek philosopher once wrote. Even if he wasn't a scientist he must have been on to something. After all, most of today's new scientific output is geared towards applications of one form or another. That's not to say that a purely theoretical kind of research is not of any value. However, purely theoretical research is still knowledge in progress. Once this research finds its way to robustness (through fidelity) and a model or a physical system that applies itself to solve a real problem, then it will have completed its evolutionary journey. Naturally, research, fidelity, and application are not isolated from each other. There is a great deal of interaction between them, partly because they are part of an organic whole and partly because one stems from the other. Without research, we cannot talk about fidelity, nor an application. Technology (which is linked to the latter) doesn’t come out of thin air. Also, no matter how much research we do, without testing the new knowledge to ensure a level of fidelity in it, we cannot use this knowledge elsewhere without contaminating the existing pool of knowledge. What’s more, if an application doesn’t work well enough we often need to go back to the research stage to refine the underlying knowledge. Finally, sometimes it is the application that drives both research and fidelity, giving everything an end goal and a quality standard. Otherwise, we could be researching for research’s sake without ever producing any new knowledge that can benefit others. Because of all this, it makes sense to connect three data points to form a triangle. We can also draw the circle around this triangle as a way to picture another important aspect of scientific knowledge: its scope (see figure below). No scientific theory aims to explain everything, except perhaps some ambitious projects in Physics that aim to unify all existing theories regarding the universe (though none of them have been successful while their chances of success are a highly debatable matter). That’s where scope fits in to put all this into perspective. In data science scope is beautifully explained in the "No Free Lunch Theorem" which goes on to say that if a model or algorithm has an edge over the alternatives for a particular kind of datasets, it means that it is bound to be weaker against these same alternatives for a different kind of dataset. In other words, no model outperforms all its alternatives always, just like there is no car out there that's better than all other cars for all sorts of terrain. Naturally, as the scientific knowledge grows for a particular domain, new knowledge may come about that has a larger scope (e.g. a theory that explains more of the observed phenomena and the corresponding data). Still, most new knowledge tends to have a very specific scope that although small in relation to the whole domain, it is still useful, as there is value in niche systems (think of a pickup truck that although a bit specialized, it addresses certain use cases very effectively and efficiently). The aforementioned aspects of scientific knowledge are the basis of the highlevel model proposed here as an effort to describe it. Naturally, this is just the beginning, as other factors come into play once we examine things from a larger time frame. This, however, is something that deserves its own article, so stay tuned...
0 Comments
Even if you are not a Bayesian Stats fan, it’s not hard to appreciate this data analytics framework. In fact, it would irresponsible if you were to disregard it without delving into it, at least to some extent. Nevertheless, the fact is that Frequentist Stats (see image above), as well as Machine Learning, are more popular in data science. Let's explore the reasons why this is. Bayesian Stats relies primarily on the various versions of the Bayes Theorem. In a nutshell, this theorem states that if we have some idea of the a priori probabilities of an event A happening (as well as A not happening), as well as the likelihoods of event B happening given event A happening (as well as A not happening), we can estimate the probability of A given B. This is useful in a variety of cases, particularly when we don't have a great deal of data at our disposal. However, there is something often hard to gauge and it's the Achilles heel of Bayesian Stats. Namely, the a priori probabilities of A (aka the priors) are not always known while when they are, they are usually rough estimates. Of course, this isn't a showstopper for a Bayesian Stats analysis, but it is a weak point that many people are not comfortable with since it introduces an element of subjectivity to the whole analysis. In Frequentist Stats, there are no priors and the whole framework has an objective approach to things. This may seem a bit farfetched at times since lots of assumptions are often made but at least most people are comfortable with these assumptions. In Machine Learning, the number of assumptions is significantly smaller as it's a datadriven approach to analytics, making things easier in many ways. Another matter that makes Bayesian Stats not preferable for many people is the lack of proper education around this subject. Although it predates Frequentist Stats, Bayesian Stats never got enough traction in people's minds. The fact that Frequentist Stats was advocated by a very charismatic individual who was also a great data analyst (Ronald A. Fisher) may have contributed to that. Also, the people who embraced the different types of Statistics at the time augmented the frameworks with certain worldviews, making them more like ideological stances than anything else. As a result, since most people who worked in data analytics at the time were more partial towards Fisher's worldview, it made more sense for them to advocate Frequentist Stats. The fact that Thomas Bayes was a man of the cloth may have dissuaded some people from supporting his Statistics framework. Finally, Bayesian Stats involves a lot of advanced math when it is applied to continuous variables. As the latter scenario is quite common in most data analytics projects, Bayesian Stats ends up being a fairly esoteric discipline. The latter entails things like Monte Carlo simulations (which although fairly straightforward, they are not as simple as distribution plots and probability tables) and Markov Chains. Also, there are lots of lesserknown distributions used in Bayesian Stats (e.g. Poisson, Beta, and Gamma, just to name a few) that are not as simple or elegant as the Normal (Gaussian) distribution or the Student (t) distribution that are bread and butter for Frequentist Stats. That's not to say that the latter is a walk in the park, but it's more accessible to a beginner in data analytics. As for Machine Learning, contrary to what many people think, it too is fairly accessible, especially if you use a reliable source such as a course, a book, or even an educational video, etc. with a price tag accompanying it. Summing up, Bayesian Statistics is a great tool that’s worth exploring. If, however, you find that most data analytics professionals don’t share your enthusiasm towards it, don’t be dismayed. This is something natural as the alternative frameworks maintain an advantage over Bayesian Stats. Lately, there has been a lot of talk about the Corona Virus disease (Covid19) and Italy is allegedly a hotspot. As my partner lives in Italy and is constantly bombarded by warnings about potential infections and other alarming news like that, I figured it would be appropriate to do some backoftheenvelop calculations about this situation and put things in perspective a bit. After all, Bologna (the city where she lives) is not officially a "red zone" like Milan and a few other cities in the country. For this analysis, I used Bayes' Theorem (see formula below) along with some figures I managed to dig up, regarding the virus in the greater Bologna area. The numbers may not be 100% accurate but they are the best I could find, while the assumption made was more than generous. Namely, I used the latest numbers regarding the spread of the disease as the priors, while regarding the likelihoods (conditional probabilities regarding the test made) I had to use two figures, one from the Journal of Radiology to figure out the false positives rate (5 out of 167 or about 3%, in a particular study) and one for the true positive rate (aka precision), the aforementioned assumption, namely 99%. In reality, this number is bound to be lower but for the sake of argument, let's say that it's correct 99% of the time. Note that certain tests regarding the Covid19 using CT scans can be as low as 80%, while the test kits available in some countries have even lower precision. For the priors, I used the data reported in the newspaper, namely around 40 for the greater Bologna area. The latter has a population of about 400 000 people (including the suburbs). So, given all that, what are the chances you actually have the virus if you do a test for it the result comes back positive?
Well, by doing the math on Bayes’ theorem, it can take the form: P(infection  positive) = P(positive  infection) * P(infection) / [P(positive  infection) * P(infection) + P(positive  healthy) * P(healthy)] As being infected and being healthy are mutually exclusive, we can say that P(healthy) = 1 – P(infection). Doing some more math on this we end up with this slightly more elegant formula: P(infection  positive) = 1 / [1 + λ (1 / P(infection) – 1)] where λ = P(positive  healthy) / P(positive  infection). Plugging in all the numbers we end up with: P(infection  positive) = 1 / (1 + 303) = 0.3% (!) In other words, even if you do a proper test for Covid19, and the test is positive (i.e. the doctor tells you “you’re infected”) the chances of this being true are about 1 in 300. This is roughly equivalent to rolling a triple 1 using 3 dice (i.e. you roll three dice and the outcome is 111). Of course, if you don’t test positive, the chances of you having the virus are much lower. Note that the above analysis is for the city of Bologna and that for other cities you'll need to update the formula with the numbers that apply there. However, even if the scope of this analysis is limited to the greater Bologna area, it goes on to show that this whole situation that plagues Italy is more fearmongering than anything else. Nevertheless, it is advisable to be mindful of your health as during times of changing weather (and climate), your immune system may need some help to ensure it keeps your body healthy, so anything you do to help it is a plus. Things like exercise, a good diet, exposure to the sun, keeping stress at bay, and maintaining good body hygiene are essential regardless of what pathogens may or may not threaten your wellbeing. Stay healthy! What’s a Transductive Model?
A transductive model is a predictive analytics model that makes use of distances or similarities. Contrary to inference models that make use of induction and deduction to make their predictions, transductive models tend to be direct. Oftentimes, they don’t even have a training phase in the sense that the model “learns” as it performs its predictions on the testing set. Transductive models are generally under the machine learning umbrella and so far they have always been opaque (black boxes). What’s Transparency in a Predictive Analytics Model? Transparency is an umbrella term for anything that lends itself to a clear understanding of how it makes its predictions and/or how to interpret its results. Statistical models boast transparency since they are simple enough to understand and explain (but not simplistic). Transparency is valued greatly particularly when it comes to business decisions that use the outputs of a predictive model. For example, if you decide to let an employee go, you want to be able to explain why, be it to your manager, to your team, or the employee himself. Transparent kNN? Transparent kNN sounds like an oxymoron, partly because the basic algorithm itself is a moron. It's very hard to think of a simpler and more basic algorithm in machine learning. This, however, hasn't stopped people from using it again and again due to the high speed it exhibits, particularly in smaller datasets. Still, kNN has been a black box so far, despite its many variants, some of which are ingenious indeed. Lately, I've been experimenting with distances and on how they can be broken down into their fundamental components. As a result, I managed to develop a method for a distance metric that is transparent by design. By employing this same metric on the kNN model, and by applying some tweaks in various parts of it, the transparent version of kNN came about. In particular, this transparent kNN model yields not only its predictions about the data at hand but also a confidence metric (akin to a probability score for each one of its predictions) and a weight matrix consisting of the weight each feature has in each one of its predictions. Naturally, as kNN is a model used in both classification and regression, all of the above are available in either one of its modalities. On top of that, the system can identify what modality to use based on the target variable of the training set. What’s Next? For now, I’ll probably continue with other, more useful matters, such as feature fusion. After all, just like most machine learning models, kNN is at the mercy of the features it is given. If I were in academic research, I’d probably write a series of papers on this topic, but as I work solo on these endeavors, I need to prioritize. However, for anyone interested in learning more about this, I’m happy to reply to any queries through this blog. Cheers! For over 2 decades there is a puzzle game I've played from time to time, usually to pass the time creatively or to challenge myself in algorithm development. This game, which I was taught by a friend, didn't have a name and I never managed to find it elsewhere so I call it Numgame (as it involves numbers and it's a game). Over the years, I managed to solve many of its levels though I never got an algorithm for it, until now. The game involves a square grid, originally a 10by10 one. The simplest grid that's solvable is the 5by5 one. The object of the game is to fill the grid with numbers, starting from 1 and going all the way to n^2, where n is the size of the grid, which can be any number larger than 4 (grids of this size or lower are not solvable). To fill the grid, you can "move" horizontally, vertically and diagonally, as long as the cell you go to is empty. When moving horizontally or vertically you need to skip 2 squares, while when you move diagonally you need to skip 1. Naturally, as you progress, getting to the remaining empty squares becomes increasingly hard. That's why you need to have a strategy if you are to finish the game successfully. Naturally, not all starting positions yield a successful result. Although more often than not you'd start from a corner, you may choose to start from any other square in the grid. That's useful, considering that some grids are just not solvable if you start from a corner (see image below; empty cells are marked as zeros) Before we look at the solution I've come across, try to solve a grid on your own and think about a potential algorithm to solve any grid. At the very least, you'll gain an appreciation of the solution afterward. Anyway, the key to solving the Numgame levels is to use a heuristic that will help you assess each move. In other words, you'll need to figure out a score that discerns between good and bad positions. The latter result from the various moves. So, for each cell in the grid, you can count how many legitimate ways are there for accessing it (i.e. ways complying with the aforementioned rules). You can store these numbers in a matrix. Then, you can filter out the cells that have been occupied already, since we won't be revisiting them anyway. This leaves us with a list of numbers corresponding to the number of ways to reach the remaining empty cells. Then we can take the harmonic mean of these numbers. I chose the harmonic mean because it is very sensitive to small numbers, something we want to avoid. So, the heuristic will take very low values if even a few cells start becoming inaccessible. Also, if even a single cell becomes completely inaccessible, the heuristic will take the value 0, which is also the worst possible score. Naturally, we aim to maximize this heuristic as we examine the various positions stemming from all the legitimate moves of each position. By repeating this process, we either end up with a full grid or one that doesn't progress because it's unsolvable. This simple problem may seem obvious now, but it is a good example of how a simple heuristic can solve a problem that's otherwise tough (at least for someone who hasn't tackled it enough to figure out a viable strategy). Naturally, we could bruteforce the whole thing, but it's doubtful that this approach would be scalable. After all, in the era of A.I. we are better off seeking intelligent solutions to problems, rather than just through computing resources at them! (image by Arek Socha, available at pixabay) Lately, I've been working on the final parts of my latest book, which is contracted for the end of Spring this year. As this is probably going to be my last technical book for the foreseeable future, I'd like to put my best into it, given the available resources of time and energy. This is one of the reasons I haven't been very active on this blog as of late. In this book (whose details I’m going to reveal when it’s in the printing press) I examine various aspects of data science in a quite handson way. One of these aspects, which I often talk about with my mentees, is that of scale. Scaling is very important in data science projects, particularly those involving distancebased metrics. Although the latter may be a bit niche from a modern standpoint where A.I. based systems are often the goto option, there is still a lot of value in distances as they are usually the prima materia of almost all similarity metrics. Similaritybased systems, aka transductive systems, are quite popular even in this era of A.I. based models. This is particularly the case in clustering problems, whereby both the clustering algorithms and the evaluation metrics (e.g. Silhouette score/width) are based on distances for evaluating cluster affinity. Also, certain dimensionality reduction methods like Principle Components Analysis (PCA) often require a certain kind of scaling to function optimally. Scaling is not as simple as it may first seem. After all, it greatly depends on the application as well as the data itself (something not everyone is aware of since the way scaling/normalization is treated in data science educational material is somewhat superficial). For example, you can have a fixed range scaling process or a fixed center one. You can even have a fixed range and fixed center one at the same time if you wish, though it's not something you'd normally see anywhere. Fixed scaling is usually in the [0, 1] interval and it involves scaling the data so that its range is constant. The center point of that data (usually measured with the arithmetic mean/average), however, could be distorted. How much so depends on the structure of the data. As for the fixed center scaling, this ensures that the center of the scaled variable is a given value, usually 0. In many cases, the spread of the scaled data is fixed too, usually by setting the standard deviation to 1. Programmatic methods for performing scaling vary, perhaps more than the Stats educators will have you think. For example, in the fixed range scaling, you could use the minmax normalization (aka 01 normalization, a term that shows both limited understanding of the topic and vagueness), or you could use a nonlinear function that is also bound by these values. The advantage of the latter is that you can mitigate the effect of any outliers, without having to eradicate them, all through the use of good oldfashioned Math! Naturally, most Stats educators shy away at the mention of the word nonlinear since they like to keep things simple (perhaps too simple) so don’t expect to learn about this kind of fixedrange scaling in a Stats book. All in all, scaling is something worth keeping in mind when dealing with data, particularly when using a distancebased method or a dimensionality reduction process like PCA. Naturally, there is more to the topic than meets the eye, plus as a process, it's not as basic as it may seem through the lens of package documentation or a Stats book. Whatever the case, it's something worth utilizing, always in tandem with other data engineering tools to ensure a better quality data science project. Hello everyone and happy new year! I hope you all had a good holiday break. I thought about it quite a bit and I've decided this year to go a different direction with the videos I make as I plan to focus more on courses. Stay tuned for more news on this matter in the weeks to come... Just wanted to wish you all Happy Holidays! It's been a great year and I appreciate your support through this blog. I won't be posting anything new in the next couple of weeks as I'll b traveling. Feel free to check out some of my older posts, though. I hope your holidays are insightful, inspirational, and intriguing! More important than remembering facts and methods related to data science problems is the trinity of inspiration, intuition, and imagination, with intelligence binding them all together. However, without inspiration, none of the stuff we know about data science is bound to grow much as our knowledge and knowhow gradually crystallize and start giving in to entropy. So, I'd like to take a moment and remind everyone (including myself) the value of inspiration, even in a fairly technical field such as data science (I don't mention A.I. here because A.I. is its own source of inspiration, especially when one considers the applications of it). So, what's your data science inspiration like? Where does it come from? What does it incentivize you towards? These are questions we need to ask ourselves from time to time, in order to make our learning of the field a sustainable process. The input of other data scientists is important in helping that but they may not always inspire us, especially after we grow out of the initial stages of our learning. This beginner’s mind although powerful is also fleeting and once it gives way to a more pragmatic view of data science, it is easy to lose our original enthusiasm for the field. That’s where inspiration comes in. For me, the source of inspiration in data science is twofold: first of all, it is my own research on the field, unbound by an academic agenda or a particular ideology (e.g. futurism). Such research is still disciplined but at the same time somewhat free, as in freedom (you can’t have research void of cost, unfortunately, even if that cost is just the time you dedicate to it). The other source of inspiration is mentoring, particularly students who are committed to learning data science through a structured and disciplined manner, such as the Thinkful courses on the subject. Naturally, I’d be happy to mentor other data science aspirants but so far this hasn’t taken place, for various reasons. Beyond these, the educational material I create as well as the conferences I participate in can be a great source of inspiration too. However, these are not things that happen frequently enough so as to consider them as primary sources of inspiration, no matter how impactful they can be at times. In practice, they often act as conduits of inspiration, to a certain extent, something that’s also valuable. After all, all these aspects of my data science presence are interconnected and feed off each other. What about you? What’s your inspiration for data science like? Does it come from a particular application, methodology, or educational material? How do you ensure that inspiration is part of your data science life? The reality of data is often taken for granted, just like many things in data science. However, there is more to it than meets the eye and it's only after talking with other data professionals (particularly data architects) that this hierarchy of realities becomes accessible. Of course, this is not something you'll see in a data science book or video, but if you think about it it makes good sense. I've been thinking about it quite a bit before putting it down in words; eventually, all this helped me put things into perspective. Hopefully, it will do the same for you. First of all, as the basest and most accessible reality of data, we have the values of a dataset. This involves all the numeric and nonnumeric data that lives in the data frames we process. Naturally, this is usually referred to as data and it's the most fundamental entity we work with in every data science project. However, there is much more to all this than that since this data comes from somewhere else, through a higher abstraction of it. This abstraction is the variables of the dataset. These are much more than just containers of the data values since they often represent pieces of information that represent characteristics we can relate to in the problem we are tackling. Also, the variables themselves have an inherent structure representing a pattern, which goes beyond the data values themselves. This is why Statistics is so obsessed with various metrics describing individual variables; in a way, these metrics reflect the essence of a variable and they are usually more important than the data itself. Moreover, the relationships among all these variables are another level of reality regarding the data. After all, these variables are rarely independent of each other and the relationships among them are crucial for analyzing the data involved. This is what makes data generation a bit tricky since it's not as simple as creating data that follows the distribution of each variable involved. The relationships among the variables play a role in all this. That's why things like correlation metrics are important and help us analyze the data on a deeper level. Furthermore, there is the structure of the dataset based on the inherent patterns and the reference variable. The latter is usually the target variable we are trying to predict. Naturally, the structure of the dataset is also relevant to the previous realities, particularly the one related to the relationship of the variables, since it influences the densities of the data. However, a higherorder is introduced to the data through the target variable, making this structure even more prominent. Whatever the case, it is by understanding this structure (e.g. through clustering, feature evaluation, etc.) that we manage to gain a deeper understanding of the essence of the data. Finally, there are the multidimensional patterns that generated the data in the first place. This is the most important reality of the data since it's the one that defines the whole dataset and in a way transcends it. After all, a dataset is but a sample of all the possible data points that stem from a certain population. The latter is usually beyond reach and it can be limitless as new data usually becomes available. So, knowing these multidimensional patterns is the closest we can get to that population and making use of them is what makes a data science project successful. Naturally, A.I. is involved in each one of these realities, usually as a tool for analyzing the data. However, it’s particularly relevant in the last level whereby it figures out these multidimensional patterns and manages to create new data similar to the original. Also, understanding these patterns well enables it to make more accurate predictions, due to the generalization of the data that it accomplishes. Nevertheless, this 5fold hierarchy of the realities of the data is useful for understanding a dataset, with or without A.I. methods. As a bonus, it enables us to gain a better appreciation of the heuristics available and helps us use them more consciously. 
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I. Archives
May 2020
Categories
All
