In the previous post (not counting the webinars one, which was more of an announcement), I talked a bit about a new highlevel model about scientific knowledge. However, I didn't talk much about its evolution, since that would make for a very long article (or even a book!). In this article, I'll look into some additional parts of this model and how it can help us understand the evolution of scientific knowledge. All this is closely tied to the data science mindset since, at its core, Data Science is applied science in realworld problems. So, in the previous article, we covered research, fidelity, and application as the key aspects of scientific knowledge and how the three of them are closely linked to a fourth one, the scope. But how do all these relate to the scientist and her work? Let's find out. So, if you recall, the aforementioned factors can be visualized in the schematic we saw in the previous post. But what lies in the middle of this? What’s at the heart of scientific knowledge? If you guessed the scientific method, you are right. After all, scientific knowledge doesn't grow on trees (with the exception of that apple tree upon which Newton was resting, perhaps). The scientific method is at the core of it since it binds research, fidelity, and even application to some extent. When an engineer (or even the scientist herself) explores a new theory and tests its validity, he makes use of the scientific method. Without it, he could still argue for or against the theory, but it would be more of a philosophical kind of treatise than anything else. Naturally, philosophy has value too, especially when it is a practical kind of philosophy, like that of the Stoics. However, in science, we are interested more in things that can be formulated with mathematical formulas and be tested rigorously through various data analytics tools, such as Statistics. This scientific method also constitutes the mindset of the scientist, something very important across different disciplines. Now, if we were to explore this further, going beyond the plane of all the aforementioned aspects of scientific knowledge, we’d find (at least) two more aspects that are closely related to all this. Namely, we’ll find understanding and vision, both of which have to do with the scientist primarily. Understanding involves how deep we go into the ideas the scientific knowledge entails. It is not just rational though since it involves our intuition too. Understanding is like the roots of a tree, grounding scientific knowledge to something beyond the data and making the scientific theory we delve into something potentially imbued by enthusiasm. When you hear some scientists talk about their inventions, for example, you can almost feel that. No scientist would get passionate about math formulas but when it comes to the understanding of the scientific knowledge they have worked on, they can get quite passionate about it for sure! On the other direction of this we have vision, which has to do with what we imagine about the scientific knowledge, be it its applicability, its extensions, and even the questions it may raise. The latter may bring about additional scientific projects, evolving the knowledge (and understanding) further. That's why it makes sense to visualize this as an upwards vector. Besides, we talk about understanding going deep, which is why we'd visualize it as a downwards vector. Naturally, we'd expect these to be correlated to some extent since deeper understanding would make for loftier visions regarding the scientific knowledge we explore. Also, these two aspects of scientific knowledge highlight the evolutionary aspect of it, rending it something highly dynamic and adaptive. Hopefully, this article has shed some light on this intriguing topic. It may be a bit abstract but scientific knowledge is like this, at least until it manifests as technology. Feel free to share your thoughts on this topic through this blog. Cheers!
0 Comments
Webinars have been a valuable educational resource for years now, but only recently has the potential of this technology been valued so much. This is largely due to the Covid19 situation that has made conventional conferences a nono. Also, the low cost of webinars, coupled with the ecological advantage they have over their physical counterparts, makes webinars a great alternative. At a time when videobased content is in abundance, it's easy to find something to watch and potentially educate yourself with. However, if you want quality content and value your time more than the ease of accessibility of the stuff available for free, it's worth exploring the webinar option. Besides, nowadays the technology is more affordable than ever before, making it a high ROI endeavor. As a bonus, you get to ask the presenter questions and do a bit of networking too. How does all this fit with data science though and why is it part of this blog? Well, although webinars are good in general, they are particularly useful in data science as the latter is a hot topic. Because it's such a popular subject, data science has attracted all sorts of opportunists who brand themselves as data scientists just to make a quick buck. These people tend to create all sorts of content that is low veracity information at best (and a scam at worst). Since discerning between what's legitimate content and what's just clickbait can sometimes be difficult (these con artists have become pretty good at what they do), it makes sense to pursue reputable sources for this video content. One such source is the Technics Publications platform, which has recently started providing its own video content in the form of webinars. Although most of these webinars are on data modeling, a couple of them are on data science topics (ahem). Feel free to check them out! Disclaimer: I have a direct monetary benefit in promoting these data science webinars. However, I do so after ensuring I put a lot of work in preparing them, the same amount of work I’d put in preparing for a physical conference, like Customer Identity World and Data Modeling Zone. The only difference is the medium through which this content is delivered. Scientific knowledge is a greatly misunderstood matter, especially today. As we are bombarded with scientific innovations regularly and see scientists getting featured in various media, or even have films made about them (e.g. the classic "A Beautiful Mind" and "The Theory of Everything"), we may conclude that science is easy or that anyone with enough determination and some intelligence can make it in the scientific world. However, science is anything but easy, while someone's mental prowess and willpower although they play an important role in all this, are not the best predictors of their success. Sometimes, people are just at the right place at the right time (as in the case of Einstein). In any case, to get a better understanding of all this, let's break it down to its fundamental components through a highlevel model of sorts and see how they come into place to make scientific knowledge come about. First of all, scientific knowledge is the knowledge that comes about through the scientific method, preexisting knowledge or information (e.g. through observations or raw data) and concerning a particular problem. The latter may be something concrete (e.g. a machine that can transform chemical energy into work) or abstract (e.g. a mathematical model that explains how two variables relate to each other and how one of them can act as a predictor for the other). A problem may also be a weakness of an existing theory or model, that requires further understanding before it can more widely useful. Scientific knowledge has three primary aspects: research, fidelity, and application. Research has to do with the integration of information into a theory and/or new knowledge that supplements existing knowledge. This doesn't have to be groundbreaking since even a metaanalysis on a subject can provide crucial insight in understanding the problem at hand from a more holistic perspective. Perhaps that's why the first steps in a scientific project involve exploration and a critical analysis of the literature. In any case, it's hard to imagine scientific knowledge without research at its core since otherwise, it can become static, dogmatic and even superficial. The variety of different approaches to research and the value people place on it in all scientific institutions (including the nonacademic ones) attests to that. Fidelity has to do with developing confidence in something, particularly something new or different from what's already known. If the product of one's research doesn't carry confidence with it and it's just speculation or a thin interpretation of the data, it doesn't provide much usefulness. A scientist needs to attack the new knowledge with everything she's got to ensure that it holds water. That's why experimentation is so important as well as in some cases, peer review. The latter is very useful though not always essential since if the experiments are carried out properly and the scientist has no vested interest in the new knowledge (i.e. his intentions are pure), if there are issues with it they will surface sooner rather than later. If the new knowledge remains firm, the fidelity of it will grow along with the confidence of the scientist in what it can do. Naturally, this confidence level will never reach 100% since in science there is always room for disproving something. This brings us to the next aspect: application. This involves the application of this new knowledge into a realworld problem or some other situation that's somewhat different from the original one. It has to do with making predictions using the new theory, predictions that are of some value to someone beyond the scientist. "Application is the ultimate and most sacred form of theory," a Greek philosopher once wrote. Even if he wasn't a scientist he must have been on to something. After all, most of today's new scientific output is geared towards applications of one form or another. That's not to say that a purely theoretical kind of research is not of any value. However, purely theoretical research is still knowledge in progress. Once this research finds its way to robustness (through fidelity) and a model or a physical system that applies itself to solve a real problem, then it will have completed its evolutionary journey. Naturally, research, fidelity, and application are not isolated from each other. There is a great deal of interaction between them, partly because they are part of an organic whole and partly because one stems from the other. Without research, we cannot talk about fidelity, nor an application. Technology (which is linked to the latter) doesn’t come out of thin air. Also, no matter how much research we do, without testing the new knowledge to ensure a level of fidelity in it, we cannot use this knowledge elsewhere without contaminating the existing pool of knowledge. What’s more, if an application doesn’t work well enough we often need to go back to the research stage to refine the underlying knowledge. Finally, sometimes it is the application that drives both research and fidelity, giving everything an end goal and a quality standard. Otherwise, we could be researching for research’s sake without ever producing any new knowledge that can benefit others. Because of all this, it makes sense to connect three data points to form a triangle. We can also draw the circle around this triangle as a way to picture another important aspect of scientific knowledge: its scope (see figure below). No scientific theory aims to explain everything, except perhaps some ambitious projects in Physics that aim to unify all existing theories regarding the universe (though none of them have been successful while their chances of success are a highly debatable matter). That’s where scope fits in to put all this into perspective. In data science scope is beautifully explained in the "No Free Lunch Theorem" which goes on to say that if a model or algorithm has an edge over the alternatives for a particular kind of datasets, it means that it is bound to be weaker against these same alternatives for a different kind of dataset. In other words, no model outperforms all its alternatives always, just like there is no car out there that's better than all other cars for all sorts of terrain. Naturally, as the scientific knowledge grows for a particular domain, new knowledge may come about that has a larger scope (e.g. a theory that explains more of the observed phenomena and the corresponding data). Still, most new knowledge tends to have a very specific scope that although small in relation to the whole domain, it is still useful, as there is value in niche systems (think of a pickup truck that although a bit specialized, it addresses certain use cases very effectively and efficiently). The aforementioned aspects of scientific knowledge are the basis of the highlevel model proposed here as an effort to describe it. Naturally, this is just the beginning, as other factors come into play once we examine things from a larger time frame. This, however, is something that deserves its own article, so stay tuned... Even if you are not a Bayesian Stats fan, it’s not hard to appreciate this data analytics framework. In fact, it would irresponsible if you were to disregard it without delving into it, at least to some extent. Nevertheless, the fact is that Frequentist Stats (see image above), as well as Machine Learning, are more popular in data science. Let's explore the reasons why this is. Bayesian Stats relies primarily on the various versions of the Bayes Theorem. In a nutshell, this theorem states that if we have some idea of the a priori probabilities of an event A happening (as well as A not happening), as well as the likelihoods of event B happening given event A happening (as well as A not happening), we can estimate the probability of A given B. This is useful in a variety of cases, particularly when we don't have a great deal of data at our disposal. However, there is something often hard to gauge and it's the Achilles heel of Bayesian Stats. Namely, the a priori probabilities of A (aka the priors) are not always known while when they are, they are usually rough estimates. Of course, this isn't a showstopper for a Bayesian Stats analysis, but it is a weak point that many people are not comfortable with since it introduces an element of subjectivity to the whole analysis. In Frequentist Stats, there are no priors and the whole framework has an objective approach to things. This may seem a bit farfetched at times since lots of assumptions are often made but at least most people are comfortable with these assumptions. In Machine Learning, the number of assumptions is significantly smaller as it's a datadriven approach to analytics, making things easier in many ways. Another matter that makes Bayesian Stats not preferable for many people is the lack of proper education around this subject. Although it predates Frequentist Stats, Bayesian Stats never got enough traction in people's minds. The fact that Frequentist Stats was advocated by a very charismatic individual who was also a great data analyst (Ronald A. Fisher) may have contributed to that. Also, the people who embraced the different types of Statistics at the time augmented the frameworks with certain worldviews, making them more like ideological stances than anything else. As a result, since most people who worked in data analytics at the time were more partial towards Fisher's worldview, it made more sense for them to advocate Frequentist Stats. The fact that Thomas Bayes was a man of the cloth may have dissuaded some people from supporting his Statistics framework. Finally, Bayesian Stats involves a lot of advanced math when it is applied to continuous variables. As the latter scenario is quite common in most data analytics projects, Bayesian Stats ends up being a fairly esoteric discipline. The latter entails things like Monte Carlo simulations (which although fairly straightforward, they are not as simple as distribution plots and probability tables) and Markov Chains. Also, there are lots of lesserknown distributions used in Bayesian Stats (e.g. Poisson, Beta, and Gamma, just to name a few) that are not as simple or elegant as the Normal (Gaussian) distribution or the Student (t) distribution that are bread and butter for Frequentist Stats. That's not to say that the latter is a walk in the park, but it's more accessible to a beginner in data analytics. As for Machine Learning, contrary to what many people think, it too is fairly accessible, especially if you use a reliable source such as a course, a book, or even an educational video, etc. with a price tag accompanying it. Summing up, Bayesian Statistics is a great tool that’s worth exploring. If, however, you find that most data analytics professionals don’t share your enthusiasm towards it, don’t be dismayed. This is something natural as the alternative frameworks maintain an advantage over Bayesian Stats. Lately, there has been a lot of talk about the Corona Virus disease (Covid19) and Italy is allegedly a hotspot. As my partner lives in Italy and is constantly bombarded by warnings about potential infections and other alarming news like that, I figured it would be appropriate to do some backoftheenvelop calculations about this situation and put things in perspective a bit. After all, Bologna (the city where she lives) is not officially a "red zone" like Milan and a few other cities in the country. For this analysis, I used Bayes' Theorem (see formula below) along with some figures I managed to dig up, regarding the virus in the greater Bologna area. The numbers may not be 100% accurate but they are the best I could find, while the assumption made was more than generous. Namely, I used the latest numbers regarding the spread of the disease as the priors, while regarding the likelihoods (conditional probabilities regarding the test made) I had to use two figures, one from the Journal of Radiology to figure out the false positives rate (5 out of 167 or about 3%, in a particular study) and one for the true positive rate (aka precision), the aforementioned assumption, namely 99%. In reality, this number is bound to be lower but for the sake of argument, let's say that it's correct 99% of the time. Note that certain tests regarding the Covid19 using CT scans can be as low as 80%, while the test kits available in some countries have even lower precision. For the priors, I used the data reported in the newspaper, namely around 40 for the greater Bologna area. The latter has a population of about 400 000 people (including the suburbs). So, given all that, what are the chances you actually have the virus if you do a test for it the result comes back positive?
Well, by doing the math on Bayes’ theorem, it can take the form: P(infection  positive) = P(positive  infection) * P(infection) / [P(positive  infection) * P(infection) + P(positive  healthy) * P(healthy)] As being infected and being healthy are mutually exclusive, we can say that P(healthy) = 1 – P(infection). Doing some more math on this we end up with this slightly more elegant formula: P(infection  positive) = 1 / [1 + λ (1 / P(infection) – 1)] where λ = P(positive  healthy) / P(positive  infection). Plugging in all the numbers we end up with: P(infection  positive) = 1 / (1 + 303) = 0.3% (!) In other words, even if you do a proper test for Covid19, and the test is positive (i.e. the doctor tells you “you’re infected”) the chances of this being true are about 1 in 300. This is roughly equivalent to rolling a triple 1 using 3 dice (i.e. you roll three dice and the outcome is 111). Of course, if you don’t test positive, the chances of you having the virus are much lower. Note that the above analysis is for the city of Bologna and that for other cities you'll need to update the formula with the numbers that apply there. However, even if the scope of this analysis is limited to the greater Bologna area, it goes on to show that this whole situation that plagues Italy is more fearmongering than anything else. Nevertheless, it is advisable to be mindful of your health as during times of changing weather (and climate), your immune system may need some help to ensure it keeps your body healthy, so anything you do to help it is a plus. Things like exercise, a good diet, exposure to the sun, keeping stress at bay, and maintaining good body hygiene are essential regardless of what pathogens may or may not threaten your wellbeing. Stay healthy! What’s a Transductive Model?
A transductive model is a predictive analytics model that makes use of distances or similarities. Contrary to inference models that make use of induction and deduction to make their predictions, transductive models tend to be direct. Oftentimes, they don’t even have a training phase in the sense that the model “learns” as it performs its predictions on the testing set. Transductive models are generally under the machine learning umbrella and so far they have always been opaque (black boxes). What’s Transparency in a Predictive Analytics Model? Transparency is an umbrella term for anything that lends itself to a clear understanding of how it makes its predictions and/or how to interpret its results. Statistical models boast transparency since they are simple enough to understand and explain (but not simplistic). Transparency is valued greatly particularly when it comes to business decisions that use the outputs of a predictive model. For example, if you decide to let an employee go, you want to be able to explain why, be it to your manager, to your team, or the employee himself. Transparent kNN? Transparent kNN sounds like an oxymoron, partly because the basic algorithm itself is a moron. It's very hard to think of a simpler and more basic algorithm in machine learning. This, however, hasn't stopped people from using it again and again due to the high speed it exhibits, particularly in smaller datasets. Still, kNN has been a black box so far, despite its many variants, some of which are ingenious indeed. Lately, I've been experimenting with distances and on how they can be broken down into their fundamental components. As a result, I managed to develop a method for a distance metric that is transparent by design. By employing this same metric on the kNN model, and by applying some tweaks in various parts of it, the transparent version of kNN came about. In particular, this transparent kNN model yields not only its predictions about the data at hand but also a confidence metric (akin to a probability score for each one of its predictions) and a weight matrix consisting of the weight each feature has in each one of its predictions. Naturally, as kNN is a model used in both classification and regression, all of the above are available in either one of its modalities. On top of that, the system can identify what modality to use based on the target variable of the training set. What’s Next? For now, I’ll probably continue with other, more useful matters, such as feature fusion. After all, just like most machine learning models, kNN is at the mercy of the features it is given. If I were in academic research, I’d probably write a series of papers on this topic, but as I work solo on these endeavors, I need to prioritize. However, for anyone interested in learning more about this, I’m happy to reply to any queries through this blog. Cheers! For over 2 decades there is a puzzle game I've played from time to time, usually to pass the time creatively or to challenge myself in algorithm development. This game, which I was taught by a friend, didn't have a name and I never managed to find it elsewhere so I call it Numgame (as it involves numbers and it's a game). Over the years, I managed to solve many of its levels though I never got an algorithm for it, until now. The game involves a square grid, originally a 10by10 one. The simplest grid that's solvable is the 5by5 one. The object of the game is to fill the grid with numbers, starting from 1 and going all the way to n^2, where n is the size of the grid, which can be any number larger than 4 (grids of this size or lower are not solvable). To fill the grid, you can "move" horizontally, vertically and diagonally, as long as the cell you go to is empty. When moving horizontally or vertically you need to skip 2 squares, while when you move diagonally you need to skip 1. Naturally, as you progress, getting to the remaining empty squares becomes increasingly hard. That's why you need to have a strategy if you are to finish the game successfully. Naturally, not all starting positions yield a successful result. Although more often than not you'd start from a corner, you may choose to start from any other square in the grid. That's useful, considering that some grids are just not solvable if you start from a corner (see image below; empty cells are marked as zeros) Before we look at the solution I've come across, try to solve a grid on your own and think about a potential algorithm to solve any grid. At the very least, you'll gain an appreciation of the solution afterward. Anyway, the key to solving the Numgame levels is to use a heuristic that will help you assess each move. In other words, you'll need to figure out a score that discerns between good and bad positions. The latter result from the various moves. So, for each cell in the grid, you can count how many legitimate ways are there for accessing it (i.e. ways complying with the aforementioned rules). You can store these numbers in a matrix. Then, you can filter out the cells that have been occupied already, since we won't be revisiting them anyway. This leaves us with a list of numbers corresponding to the number of ways to reach the remaining empty cells. Then we can take the harmonic mean of these numbers. I chose the harmonic mean because it is very sensitive to small numbers, something we want to avoid. So, the heuristic will take very low values if even a few cells start becoming inaccessible. Also, if even a single cell becomes completely inaccessible, the heuristic will take the value 0, which is also the worst possible score. Naturally, we aim to maximize this heuristic as we examine the various positions stemming from all the legitimate moves of each position. By repeating this process, we either end up with a full grid or one that doesn't progress because it's unsolvable. This simple problem may seem obvious now, but it is a good example of how a simple heuristic can solve a problem that's otherwise tough (at least for someone who hasn't tackled it enough to figure out a viable strategy). Naturally, we could bruteforce the whole thing, but it's doubtful that this approach would be scalable. After all, in the era of A.I. we are better off seeking intelligent solutions to problems, rather than just through computing resources at them! (image by Arek Socha, available at pixabay) Lately, I've been working on the final parts of my latest book, which is contracted for the end of Spring this year. As this is probably going to be my last technical book for the foreseeable future, I'd like to put my best into it, given the available resources of time and energy. This is one of the reasons I haven't been very active on this blog as of late. In this book (whose details I’m going to reveal when it’s in the printing press) I examine various aspects of data science in a quite handson way. One of these aspects, which I often talk about with my mentees, is that of scale. Scaling is very important in data science projects, particularly those involving distancebased metrics. Although the latter may be a bit niche from a modern standpoint where A.I. based systems are often the goto option, there is still a lot of value in distances as they are usually the prima materia of almost all similarity metrics. Similaritybased systems, aka transductive systems, are quite popular even in this era of A.I. based models. This is particularly the case in clustering problems, whereby both the clustering algorithms and the evaluation metrics (e.g. Silhouette score/width) are based on distances for evaluating cluster affinity. Also, certain dimensionality reduction methods like Principle Components Analysis (PCA) often require a certain kind of scaling to function optimally. Scaling is not as simple as it may first seem. After all, it greatly depends on the application as well as the data itself (something not everyone is aware of since the way scaling/normalization is treated in data science educational material is somewhat superficial). For example, you can have a fixed range scaling process or a fixed center one. You can even have a fixed range and fixed center one at the same time if you wish, though it's not something you'd normally see anywhere. Fixed scaling is usually in the [0, 1] interval and it involves scaling the data so that its range is constant. The center point of that data (usually measured with the arithmetic mean/average), however, could be distorted. How much so depends on the structure of the data. As for the fixed center scaling, this ensures that the center of the scaled variable is a given value, usually 0. In many cases, the spread of the scaled data is fixed too, usually by setting the standard deviation to 1. Programmatic methods for performing scaling vary, perhaps more than the Stats educators will have you think. For example, in the fixed range scaling, you could use the minmax normalization (aka 01 normalization, a term that shows both limited understanding of the topic and vagueness), or you could use a nonlinear function that is also bound by these values. The advantage of the latter is that you can mitigate the effect of any outliers, without having to eradicate them, all through the use of good oldfashioned Math! Naturally, most Stats educators shy away at the mention of the word nonlinear since they like to keep things simple (perhaps too simple) so don’t expect to learn about this kind of fixedrange scaling in a Stats book. All in all, scaling is something worth keeping in mind when dealing with data, particularly when using a distancebased method or a dimensionality reduction process like PCA. Naturally, there is more to the topic than meets the eye, plus as a process, it's not as basic as it may seem through the lens of package documentation or a Stats book. Whatever the case, it's something worth utilizing, always in tandem with other data engineering tools to ensure a better quality data science project. Hello everyone and happy new year! I hope you all had a good holiday break. I thought about it quite a bit and I've decided this year to go a different direction with the videos I make as I plan to focus more on courses. Stay tuned for more news on this matter in the weeks to come... Just wanted to wish you all Happy Holidays! It's been a great year and I appreciate your support through this blog. I won't be posting anything new in the next couple of weeks as I'll b traveling. Feel free to check out some of my older posts, though. I hope your holidays are insightful, inspirational, and intriguing! 
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I. Archives
October 2019
Categories
All
