Although it's fairly easy to compare two continuous variables and assess their similarity, it's not so straight-forward when you perform the same task on categorical variables. Of course, things are fairly simple when the variables at hand are binary (aka dummy variables), but even in this case, it's not as obvious as you may think.
For example, if two variables are aligned (zeros to zeros and ones to ones), that’s fine. You can use Jaccard similarity to gauge how similar they are. But what happens when the two variables are reversely similar (the zeros of the first variable correspond to the ones of the second, and vice versa)? Then Jaccard similarity finds them dissimilar though there is no doubt that such a pair of variables may be relevant and the first one could be used to predict the second variable. Enter the Symmetric Jaccard Similarity (SJS), a metric that can alleviate this shortcoming of the original Jaccard similarity. Namely, it takes the maximum of the two Jaccard similarities, one with the features as they are originally and one with one of them reversed.
SJS is easy to use and scalable, while its implementation in Julia is quite straight-forward. You just need to be comfortable with contingency tables, something that’s already an easy task in this language, though you can also code it from scratch without too much of a challenge. Anyway, SJS is fairly simple a metric, and something I've been using for years now. However, only recently did I explore its generalization to nominal variables, something that’s not as simple as it may first seem.
Applying the SJS metric to a pair of nominal variables entails maximizing the potential similarity value between them, just like the original SJS does for binary variables. In other words, it shuffles the first variables until the similarity of it with the second variable is maximized, something that’s done in a deterministic and scalable manner. However, it becomes apparent through the algorithm that SJS may fail to reveal the edge that a non-symmetric approach may yield, namely in the case where certain values of the first variable are more similar toward a particular value of the second variable. In a practical sense it means that certain values of the nominal feature at hand are good at predicting a specific class, but not all of the classes.
That's why an exhaustive search of all the binary combinations is generally better, since a given nominal feature may have more to offer in a classification model if it's broken down into several binary ones. That's something we do anyway, but this investigation through the SJS metric illustrates why this strategy is also a good one.
Of course, SJS for nominal features may be useful for assessing if one of them is redundant. Just like we apply some correlation metric for a group of continuous features, we can apply SJS for a group of nominal features, eliminating those that are unnecessary, before we start breaking them down into binary ones, something that can make the dataset explode in size in some cases.
All this is something I’ve been working on the other day, as part of another project. In my latest book “Julia for Machine Learning” (Technics Publications) I talk about such metrics (not SJS in particular) and how you can develop them from scratch in this programming language. Feel free to check it out. Cheers!
The concept of antifragility is well-established by Dr. Taleb and has even been adopted by the mainstream to some extent (e.g. in Investopedia). This is a vast concept and it’s unlikely that I can do it justice, especially in a blog post. That’s why I suggest you familiarize yourself with it first before reading the rest of this article.
Antifragility is not only desirable but also essential to some extent, particularly when it comes to data science / AI work. Even though most data models are antifragile by nature (particularly the more sophisticated ones that manage to get every drop of signal from the data they are given), there are fragilities all over the place when it comes to how these models are used. A clear example of this is the computer code around them. I’m not referring to the code that’s used to implement them, usually coming from some specialized packages. That code is fine and usually better than most code found in data science / AI projects. The code around the models, however, be it the one taking care of ETL work, feature engineering, and even data visualization, may not always be good enough.
Antifragility applies to computer code in various ways. Here are the ones I’ve found so far:
All this may seem like a lot of work and it may not agree with your temporal restrictions, particularly if you have strict deadlines. However, you can always improve on your code after you’ve cleared a milestone. This way, you can avoid some Black Swans like an error being thrown while the program you’ve made is already in production. Cheers!
Last week I had to do a major operation for my computer. Namely, I had to replace the hard drive, as it was failing (regular warnings from the computer’s SMART diagnostics reminded me of the fact). Also, the fact that there wasn’t a single computer shop around that was a) open for business and b) willing to undertake such a task, didn’t help things. So, after waiting for about a month for a new hard disk to arrive by post, I took out my toolbox and started the operation of changing the SSD that my computer had. Naturally, I had backed up all my data beforehand and gotten a USB disk ready with an OS image installed so that I use after the new hard disk was up and running.
I won’t go into detail regarding the unbelievable challenges this process entailed (from stuck screws to a failing USB disk, to archive files that were apparently corrupt and unable to restore their content to the new hard disk). Instead, I’d like to focus on the gist of this whole experience, something that’s by far more relateable than the specifics of my situation. In essence, this whole situation was a “close to the metal” kind of experience, one that was both grounding and educational in a hands-on sense. Planning things is fairly easy but executing the plan and improvising alternative routes due to unforeseen (and possibly unforeseeable) circumstances is something we can all learn about. For example, at one point I had to find a different way to get the system running (an alternative USB disk), do a video call with a friend of mine (thanks Matt!) to troubleshoot the issue, and even come up with a contingency for backing up data in the future, so that it’s less prone to issues.
How does all this relate to data science? Well, in data science / AI projects we often have to deal with challenging situation that require us to get out of our comfort zone. We may even need to go to “closer to the metal” territory, e.g. the OS shell, for ETL tasks and such. Also, we may have to re-examine the architecture of the model used (e.g. the number of nodes for each layer, in the case of an ANN), the data used for the training of the model (do we really need all of the variables / data points?), and other factors that we often don’t think about.
Being closer to the metal is not something that concerns programmers only or computer technicians. It’s a state of mind that can come in very handy, even in high-level professions such as ours. Just like a good leader in a company has good relations with every echelon of his organization, even people he doesn’t interact with on a regular basis, a good data scientist ought to do the same. Detachment is useful in problem-solving but let’s not make it our default way of being. Sometimes we need to roll up our sleeves and handle tools we don’t usually use (e.g. the aforementioned screwdriver). With the right attitude, this can be a growing experience. Cheers!
If you don’t know what the word hyperthesis means don’t worry, it’s a term I came up myself. Stemming from the Greek “υπέρθεση” which means “hyperposition” or “superposition” depending on how you translate it, it is a term that describes transcendence of the binary state, but in a dynamic context (not to be confused with the quantum superposition which is somewhat different). In other words, it has to do with the controlled oscillation between extreme states until an equilibrium state is attained, at least at a reasonable robustness level that is predefined in the specs of the project at hand.
The Hyperthesis Principle is, therefore, a principle that describes the behavior of a complex system that is characterized by a hyperthetical behavior. Namely, if a system's state oscillates between two extreme states until it reaches an equilibrium of sorts, it exhibits hyperthetical behavior. If this behavior is a function of the parameters of the data this system relies on, then the system can, in theory, attain a stable evolutionary course that will result in equilibrium, namely a robust state.
“What does this have to do with data science, doc?” I can hear you say. Well, if you have been reading my blog, you may recall that predictive data models, especially the more sophisticated ones, are in essence complex systems. As such they may be in any state in the high bias – high variance spectrum. Now, we may tweak the parameters like a drunkard, hoping that we get them right, or we can do so through an understanding of the data and the model at hand. One way to accomplish the latter is through grid search, though this may not always be easy or affordable computationally. Imagine an SVM, for example, that is trained over a large dataset. It may take a while to find the optimum parameters for that model through a grid search, which is why we often revert to more stochastic approaches. This is where AI creeps in, even if we don't call it that. However, whenever a sophisticated optimization method is applied, the system exhibits a form of rudimentary intelligence. The more advanced the optimizer, the more it fits the bill, and calling it AI comes effortlessly.
Anyway, if we were to apply intelligence, artificial or otherwise, to a problem like that, we are in essence applying the hyperthesis principle. How well we do so, depends on how well we understand the problem we are trying to solve. However, being aware of this principle and applying it consciously can greatly facilitate the whole process. After all, all this is done through an iterative process, oftentimes involving several iterations of training and testing. Setting up the corresponding experiments can be aligned with the aforementioned principle, optimizing the whole process. So, instead of tweaking the model haphazardly, we make changes to it that make sense and navigate it towards a point in the parameter space that optimizes performance and robustness.
Understanding all this is the most important step in truly understanding AI and allowing this understanding to enhance our thinking. Also, it is at the core of the data science mindset. Cheers!
Hi everyone. Since these days I explore a different avenue for data science education, I've put together another webinar that's just 3 weeks away (May 18th). If you are interested in AI, be it as a data science professional or a stakeholder in data science projects, this is something that can add value to you. Also, you'll have a chance to ask me questions directly and if the time allows, even have a short discussion on this topic.
Note that due to the success of previous webinars in the Technics Publications platforms, the price of each webinar has risen. However, this upcoming webinar, which was originally designed as a talk for an international conference in Germany, is still at the very accessible price of $14.99. Feel free to check it out here and spread the word to friends or colleagues. You can also learn about the other webinars this platform offers through the corresponding web page. Cheers!
With more and more people getting into data science and AI these days, certain aspects of the field are inevitably over-emphasized while others are neglected. Naturally, those providing the corresponding know-how are not professional educators, even if they are competent practitioners and very knowledgeable people. As a result, a lot of emphasis is given to the technical aspects, such as math and programming related skills, data visualization, etc. What about domain knowledge though? Where does that fit in the whole picture?
Domain knowledge is all that knowledge that is specific to the domain data science or AI is applied on. If you are in the finance industry, it involves economics theory as well as how certain econometric indexes come into play. In the epidemiology sector, it involves some knowledge as to how viruses come about, how they propagate, and their effects on the organisms they exploit. Naturally, even if domain knowledge is specialized, it may play an important role in many cases. How much exactly depends on the problem at hand as well as how deep the data scientist or AI practitioner wants to go into the subject.
Domain knowledge may also include certain business-related aspects that also factor in data science work. Understanding the role of the different individuals who participate in a project is very important, especially if you are tackling a problem that is too complex for data professionals alone. Oftentimes, in projects like this, subject matter experts (SMEs) are utilized and as a data scientist or AI professional you need to liaise with them. This is not always easy as there is limited common ground that can be used as a frame of reference. That's where some general-purpose business knowledge comes in handy.
Naturally, incorporating domain knowledge in a data science project is a challenge in and of itself. Even if you do have this non-technical knowledge, you need to find ways to include it in the project organically, adding value to your analysis. That's why certain questions, particularly high-level questions that the stakeholders may want to be answered, are very important. Pairing these questions with other, more low-level questions that have to do with the data at hand, is crucial. Part of being a holistic, well-rounded data science / AI professional involves being able to accomplish this.
Of course, exploring this vast topic in a single or even multiple blog posts isn’t practical. Besides, how much can someone go into depth about this subject without getting difficult to read, especially if you are accessing this blog site via a mobile device? For this purpose, my co-author and I have gathered all the material we have accumulated on this topic and put it in a more refined form, namely a technical book. We are now at the final stages of this book, which is titled “Data Scientist Bedside Manner” and is published by Technics Publications. The book should be available before the end of the season. Stay tuned for more details...
In the previous post (not counting the webinars one, which was more of an announcement), I talked a bit about a new high-level model about scientific knowledge. However, I didn't talk much about its evolution, since that would make for a very long article (or even a book!). In this article, I'll look into some additional parts of this model and how it can help us understand the evolution of scientific knowledge. All this is closely tied to the data science mindset since, at its core, Data Science is applied science in real-world problems. So, in the previous article, we covered research, fidelity, and application as the key aspects of scientific knowledge and how the three of them are closely linked to a fourth one, the scope. But how do all these relate to the scientist and her work? Let's find out.
So, if you recall, the aforementioned factors can be visualized in the schematic we saw in the previous post.
But what lies in the middle of this? What’s at the heart of scientific knowledge? If you guessed the scientific method, you are right. After all, scientific knowledge doesn't grow on trees (with the exception of that apple tree upon which Newton was resting, perhaps). The scientific method is at the core of it since it binds research, fidelity, and even application to some extent. When an engineer (or even the scientist herself) explores a new theory and tests its validity, he makes use of the scientific method. Without it, he could still argue for or against the theory, but it would be more of a philosophical kind of treatise than anything else. Naturally, philosophy has value too, especially when it is a practical kind of philosophy, like that of the Stoics. However, in science, we are interested more in things that can be formulated with mathematical formulas and be tested rigorously through various data analytics tools, such as Statistics. This scientific method also constitutes the mindset of the scientist, something very important across different disciplines.
Now, if we were to explore this further, going beyond the plane of all the aforementioned aspects of scientific knowledge, we’d find (at least) two more aspects that are closely related to all this. Namely, we’ll find understanding and vision, both of which have to do with the scientist primarily. Understanding involves how deep we go into the ideas the scientific knowledge entails. It is not just rational though since it involves our intuition too. Understanding is like the roots of a tree, grounding scientific knowledge to something beyond the data and making the scientific theory we delve into something potentially imbued by enthusiasm. When you hear some scientists talk about their inventions, for example, you can almost feel that. No scientist would get passionate about math formulas but when it comes to the understanding of the scientific knowledge they have worked on, they can get quite passionate about it for sure!
On the other direction of this we have vision, which has to do with what we imagine about the scientific knowledge, be it its applicability, its extensions, and even the questions it may raise. The latter may bring about additional scientific projects, evolving the knowledge (and understanding) further. That's why it makes sense to visualize this as an upwards vector. Besides, we talk about understanding going deep, which is why we'd visualize it as a downwards vector. Naturally, we'd expect these to be correlated to some extent since deeper understanding would make for loftier visions regarding the scientific knowledge we explore. Also, these two aspects of scientific knowledge highlight the evolutionary aspect of it, rending it something highly dynamic and adaptive.
Hopefully, this article has shed some light on this intriguing topic. It may be a bit abstract but scientific knowledge is like this, at least until it manifests as technology. Feel free to share your thoughts on this topic through this blog. Cheers!
Scientific knowledge is a greatly misunderstood matter, especially today. As we are bombarded with scientific innovations regularly and see scientists getting featured in various media, or even have films made about them (e.g. the classic "A Beautiful Mind" and "The Theory of Everything"), we may conclude that science is easy or that anyone with enough determination and some intelligence can make it in the scientific world. However, science is anything but easy, while someone's mental prowess and willpower although they play an important role in all this, are not the best predictors of their success. Sometimes, people are just at the right place at the right time (as in the case of Einstein). In any case, to get a better understanding of all this, let's break it down to its fundamental components through a high-level model of sorts and see how they come into place to make scientific knowledge come about.
First of all, scientific knowledge is the knowledge that comes about through the scientific method, preexisting knowledge or information (e.g. through observations or raw data) and concerning a particular problem. The latter may be something concrete (e.g. a machine that can transform chemical energy into work) or abstract (e.g. a mathematical model that explains how two variables relate to each other and how one of them can act as a predictor for the other). A problem may also be a weakness of an existing theory or model, that requires further understanding before it can more widely useful.
Scientific knowledge has three primary aspects: research, fidelity, and application. Research has to do with the integration of information into a theory and/or new knowledge that supplements existing knowledge. This doesn't have to be groundbreaking since even a meta-analysis on a subject can provide crucial insight in understanding the problem at hand from a more holistic perspective. Perhaps that's why the first steps in a scientific project involve exploration and a critical analysis of the literature. In any case, it's hard to imagine scientific knowledge without research at its core since otherwise, it can become static, dogmatic and even superficial. The variety of different approaches to research and the value people place on it in all scientific institutions (including the non-academic ones) attests to that.
Fidelity has to do with developing confidence in something, particularly something new or different from what's already known. If the product of one's research doesn't carry confidence with it and it's just speculation or a thin interpretation of the data, it doesn't provide much usefulness. A scientist needs to attack the new knowledge with everything she's got to ensure that it holds water. That's why experimentation is so important as well as in some cases, peer review. The latter is very useful though not always essential since if the experiments are carried out properly and the scientist has no vested interest in the new knowledge (i.e. his intentions are pure), if there are issues with it they will surface sooner rather than later. If the new knowledge remains firm, the fidelity of it will grow along with the confidence of the scientist in what it can do. Naturally, this confidence level will never reach 100% since in science there is always room for disproving something.
This brings us to the next aspect: application. This involves the application of this new knowledge into a real-world problem or some other situation that's somewhat different from the original one. It has to do with making predictions using the new theory, predictions that are of some value to someone beyond the scientist. "Application is the ultimate and most sacred form of theory," a Greek philosopher once wrote. Even if he wasn't a scientist he must have been on to something. After all, most of today's new scientific output is geared towards applications of one form or another. That's not to say that a purely theoretical kind of research is not of any value. However, purely theoretical research is still knowledge in progress. Once this research finds its way to robustness (through fidelity) and a model or a physical system that applies itself to solve a real problem, then it will have completed its evolutionary journey.
Naturally, research, fidelity, and application are not isolated from each other. There is a great deal of interaction between them, partly because they are part of an organic whole and partly because one stems from the other. Without research, we cannot talk about fidelity, nor an application. Technology (which is linked to the latter) doesn’t come out of thin air. Also, no matter how much research we do, without testing the new knowledge to ensure a level of fidelity in it, we cannot use this knowledge elsewhere without contaminating the existing pool of knowledge. What’s more, if an application doesn’t work well enough we often need to go back to the research stage to refine the underlying knowledge. Finally, sometimes it is the application that drives both research and fidelity, giving everything an end goal and a quality standard. Otherwise, we could be researching for research’s sake without ever producing any new knowledge that can benefit others.
Because of all this, it makes sense to connect three data points to form a triangle. We can also draw the circle around this triangle as a way to picture another important aspect of scientific knowledge: its scope (see figure below). No scientific theory aims to explain everything, except perhaps some ambitious projects in Physics that aim to unify all existing theories regarding the universe (though none of them have been successful while their chances of success are a highly debatable matter). That’s where scope fits in to put all this into perspective.
In data science scope is beautifully explained in the "No Free Lunch Theorem" which goes on to say that if a model or algorithm has an edge over the alternatives for a particular kind of datasets, it means that it is bound to be weaker against these same alternatives for a different kind of dataset. In other words, no model outperforms all its alternatives always, just like there is no car out there that's better than all other cars for all sorts of terrain. Naturally, as the scientific knowledge grows for a particular domain, new knowledge may come about that has a larger scope (e.g. a theory that explains more of the observed phenomena and the corresponding data). Still, most new knowledge tends to have a very specific scope that although small in relation to the whole domain, it is still useful, as there is value in niche systems (think of a pickup truck that although a bit specialized, it addresses certain use cases very effectively and efficiently).
The aforementioned aspects of scientific knowledge are the basis of the high-level model proposed here as an effort to describe it. Naturally, this is just the beginning, as other factors come into play once we examine things from a larger time frame. This, however, is something that deserves its own article, so stay tuned...
What’s a Transductive Model?
A transductive model is a predictive analytics model that makes use of distances or similarities. Contrary to inference models that make use of induction and deduction to make their predictions, transductive models tend to be direct. Oftentimes, they don’t even have a training phase in the sense that the model “learns” as it performs its predictions on the testing set. Transductive models are generally under the machine learning umbrella and so far they have always been opaque (black boxes).
What’s Transparency in a Predictive Analytics Model?
Transparency is an umbrella term for anything that lends itself to a clear understanding of how it makes its predictions and/or how to interpret its results. Statistical models boast transparency since they are simple enough to understand and explain (but not simplistic). Transparency is valued greatly particularly when it comes to business decisions that use the outputs of a predictive model. For example, if you decide to let an employee go, you want to be able to explain why, be it to your manager, to your team, or the employee himself.
Transparent kNN sounds like an oxymoron, partly because the basic algorithm itself is a moron. It's very hard to think of a simpler and more basic algorithm in machine learning. This, however, hasn't stopped people from using it again and again due to the high speed it exhibits, particularly in smaller datasets. Still, kNN has been a black box so far, despite its many variants, some of which are ingenious indeed.
Lately, I've been experimenting with distances and on how they can be broken down into their fundamental components. As a result, I managed to develop a method for a distance metric that is transparent by design. By employing this same metric on the kNN model, and by applying some tweaks in various parts of it, the transparent version of kNN came about.
In particular, this transparent kNN model yields not only its predictions about the data at hand but also a confidence metric (akin to a probability score for each one of its predictions) and a weight matrix consisting of the weight each feature has in each one of its predictions. Naturally, as kNN is a model used in both classification and regression, all of the above are available in either one of its modalities. On top of that, the system can identify what modality to use based on the target variable of the training set.
For now, I’ll probably continue with other, more useful matters, such as feature fusion. After all, just like most machine learning models, kNN is at the mercy of the features it is given. If I were in academic research, I’d probably write a series of papers on this topic, but as I work solo on these endeavors, I need to prioritize. However, for anyone interested in learning more about this, I’m happy to reply to any queries through this blog. Cheers!
The reality of data is often taken for granted, just like many things in data science. However, there is more to it than meets the eye and it's only after talking with other data professionals (particularly data architects) that this hierarchy of realities becomes accessible. Of course, this is not something you'll see in a data science book or video, but if you think about it it makes good sense. I've been thinking about it quite a bit before putting it down in words; eventually, all this helped me put things into perspective. Hopefully, it will do the same for you.
First of all, as the basest and most accessible reality of data, we have the values of a dataset. This involves all the numeric and non-numeric data that lives in the data frames we process. Naturally, this is usually referred to as data and it's the most fundamental entity we work with in every data science project. However, there is much more to all this than that since this data comes from somewhere else, through a higher abstraction of it.
This abstraction is the variables of the dataset. These are much more than just containers of the data values since they often represent pieces of information that represent characteristics we can relate to in the problem we are tackling. Also, the variables themselves have an inherent structure representing a pattern, which goes beyond the data values themselves. This is why Statistics is so obsessed with various metrics describing individual variables; in a way, these metrics reflect the essence of a variable and they are usually more important than the data itself.
Moreover, the relationships among all these variables are another level of reality regarding the data. After all, these variables are rarely independent of each other and the relationships among them are crucial for analyzing the data involved. This is what makes data generation a bit tricky since it's not as simple as creating data that follows the distribution of each variable involved. The relationships among the variables play a role in all this. That's why things like correlation metrics are important and help us analyze the data on a deeper level.
Furthermore, there is the structure of the dataset based on the inherent patterns and the reference variable. The latter is usually the target variable we are trying to predict. Naturally, the structure of the dataset is also relevant to the previous realities, particularly the one related to the relationship of the variables, since it influences the densities of the data. However, a higher-order is introduced to the data through the target variable, making this structure even more prominent. Whatever the case, it is by understanding this structure (e.g. through clustering, feature evaluation, etc.) that we manage to gain a deeper understanding of the essence of the data.
Finally, there are the multidimensional patterns that generated the data in the first place. This is the most important reality of the data since it's the one that defines the whole dataset and in a way transcends it. After all, a dataset is but a sample of all the possible data points that stem from a certain population. The latter is usually beyond reach and it can be limitless as new data usually becomes available. So, knowing these multidimensional patterns is the closest we can get to that population and making use of them is what makes a data science project successful.
Naturally, A.I. is involved in each one of these realities, usually as a tool for analyzing the data. However, it’s particularly relevant in the last level whereby it figures out these multidimensional patterns and manages to create new data similar to the original. Also, understanding these patterns well enables it to make more accurate predictions, due to the generalization of the data that it accomplishes.
Nevertheless, this 5-fold hierarchy of the realities of the data is useful for understanding a dataset, with or without A.I. methods. As a bonus, it enables us to gain a better appreciation of the heuristics available and helps us use them more consciously.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.