Although it's fairly easy to compare two continuous variables and assess their similarity, it's not so straight-forward when you perform the same task on categorical variables. Of course, things are fairly simple when the variables at hand are binary (aka dummy variables), but even in this case, it's not as obvious as you may think.
For example, if two variables are aligned (zeros to zeros and ones to ones), that’s fine. You can use Jaccard similarity to gauge how similar they are. But what happens when the two variables are reversely similar (the zeros of the first variable correspond to the ones of the second, and vice versa)? Then Jaccard similarity finds them dissimilar though there is no doubt that such a pair of variables may be relevant and the first one could be used to predict the second variable. Enter the Symmetric Jaccard Similarity (SJS), a metric that can alleviate this shortcoming of the original Jaccard similarity. Namely, it takes the maximum of the two Jaccard similarities, one with the features as they are originally and one with one of them reversed.
SJS is easy to use and scalable, while its implementation in Julia is quite straight-forward. You just need to be comfortable with contingency tables, something that’s already an easy task in this language, though you can also code it from scratch without too much of a challenge. Anyway, SJS is fairly simple a metric, and something I've been using for years now. However, only recently did I explore its generalization to nominal variables, something that’s not as simple as it may first seem.
Applying the SJS metric to a pair of nominal variables entails maximizing the potential similarity value between them, just like the original SJS does for binary variables. In other words, it shuffles the first variables until the similarity of it with the second variable is maximized, something that’s done in a deterministic and scalable manner. However, it becomes apparent through the algorithm that SJS may fail to reveal the edge that a non-symmetric approach may yield, namely in the case where certain values of the first variable are more similar toward a particular value of the second variable. In a practical sense it means that certain values of the nominal feature at hand are good at predicting a specific class, but not all of the classes.
That's why an exhaustive search of all the binary combinations is generally better, since a given nominal feature may have more to offer in a classification model if it's broken down into several binary ones. That's something we do anyway, but this investigation through the SJS metric illustrates why this strategy is also a good one.
Of course, SJS for nominal features may be useful for assessing if one of them is redundant. Just like we apply some correlation metric for a group of continuous features, we can apply SJS for a group of nominal features, eliminating those that are unnecessary, before we start breaking them down into binary ones, something that can make the dataset explode in size in some cases.
All this is something I’ve been working on the other day, as part of another project. In my latest book “Julia for Machine Learning” (Technics Publications) I talk about such metrics (not SJS in particular) and how you can develop them from scratch in this programming language. Feel free to check it out. Cheers!
This book, which is probably my last solo project, is one I wrote after discussing it with a technical publisher last summer. Although he was quite keen on the idea and was willing to offer me a good deal, I decided to stick with my current publisher (Technics Publications) for various reasons. So, a few months afterwards, this book came along. It wasn't an easy journey since at the same time I had to finalize the Data Scientist Bedside Manner book, which I co-authored with Yunus Bulut. However, with enough patience and perseverance, this book also came along shortly after the Bedside Manner was finalized.
Julia for Machine Learning is a fresh take on Julia's data science potential, with a focus on machine learning models. Although I don't cover AI-based ones in it, I make references to Julia packages you can use. The book is accompanied by a few Jupyter notebooks as well as three .jl script files containing heuristics never been seen in this language (a couple of them are brand new that I developed in the past year and are only available in Julia).
Although the book is not yet available in the publisher's website, you can find it on Amazon, both in paperback and Kindle format. Happy Julia programming!
The concept of antifragility is well-established by Dr. Taleb and has even been adopted by the mainstream to some extent (e.g. in Investopedia). This is a vast concept and it’s unlikely that I can do it justice, especially in a blog post. That’s why I suggest you familiarize yourself with it first before reading the rest of this article.
Antifragility is not only desirable but also essential to some extent, particularly when it comes to data science / AI work. Even though most data models are antifragile by nature (particularly the more sophisticated ones that manage to get every drop of signal from the data they are given), there are fragilities all over the place when it comes to how these models are used. A clear example of this is the computer code around them. I’m not referring to the code that’s used to implement them, usually coming from some specialized packages. That code is fine and usually better than most code found in data science / AI projects. The code around the models, however, be it the one taking care of ETL work, feature engineering, and even data visualization, may not always be good enough.
Antifragility applies to computer code in various ways. Here are the ones I’ve found so far:
All this may seem like a lot of work and it may not agree with your temporal restrictions, particularly if you have strict deadlines. However, you can always improve on your code after you’ve cleared a milestone. This way, you can avoid some Black Swans like an error being thrown while the program you’ve made is already in production. Cheers!
Last week I had to do a major operation for my computer. Namely, I had to replace the hard drive, as it was failing (regular warnings from the computer’s SMART diagnostics reminded me of the fact). Also, the fact that there wasn’t a single computer shop around that was a) open for business and b) willing to undertake such a task, didn’t help things. So, after waiting for about a month for a new hard disk to arrive by post, I took out my toolbox and started the operation of changing the SSD that my computer had. Naturally, I had backed up all my data beforehand and gotten a USB disk ready with an OS image installed so that I use after the new hard disk was up and running.
I won’t go into detail regarding the unbelievable challenges this process entailed (from stuck screws to a failing USB disk, to archive files that were apparently corrupt and unable to restore their content to the new hard disk). Instead, I’d like to focus on the gist of this whole experience, something that’s by far more relateable than the specifics of my situation. In essence, this whole situation was a “close to the metal” kind of experience, one that was both grounding and educational in a hands-on sense. Planning things is fairly easy but executing the plan and improvising alternative routes due to unforeseen (and possibly unforeseeable) circumstances is something we can all learn about. For example, at one point I had to find a different way to get the system running (an alternative USB disk), do a video call with a friend of mine (thanks Matt!) to troubleshoot the issue, and even come up with a contingency for backing up data in the future, so that it’s less prone to issues.
How does all this relate to data science? Well, in data science / AI projects we often have to deal with challenging situation that require us to get out of our comfort zone. We may even need to go to “closer to the metal” territory, e.g. the OS shell, for ETL tasks and such. Also, we may have to re-examine the architecture of the model used (e.g. the number of nodes for each layer, in the case of an ANN), the data used for the training of the model (do we really need all of the variables / data points?), and other factors that we often don’t think about.
Being closer to the metal is not something that concerns programmers only or computer technicians. It’s a state of mind that can come in very handy, even in high-level professions such as ours. Just like a good leader in a company has good relations with every echelon of his organization, even people he doesn’t interact with on a regular basis, a good data scientist ought to do the same. Detachment is useful in problem-solving but let’s not make it our default way of being. Sometimes we need to roll up our sleeves and handle tools we don’t usually use (e.g. the aforementioned screwdriver). With the right attitude, this can be a growing experience. Cheers!
If you don’t know what the word hyperthesis means don’t worry, it’s a term I came up myself. Stemming from the Greek “υπέρθεση” which means “hyperposition” or “superposition” depending on how you translate it, it is a term that describes transcendence of the binary state, but in a dynamic context (not to be confused with the quantum superposition which is somewhat different). In other words, it has to do with the controlled oscillation between extreme states until an equilibrium state is attained, at least at a reasonable robustness level that is predefined in the specs of the project at hand.
The Hyperthesis Principle is, therefore, a principle that describes the behavior of a complex system that is characterized by a hyperthetical behavior. Namely, if a system's state oscillates between two extreme states until it reaches an equilibrium of sorts, it exhibits hyperthetical behavior. If this behavior is a function of the parameters of the data this system relies on, then the system can, in theory, attain a stable evolutionary course that will result in equilibrium, namely a robust state.
“What does this have to do with data science, doc?” I can hear you say. Well, if you have been reading my blog, you may recall that predictive data models, especially the more sophisticated ones, are in essence complex systems. As such they may be in any state in the high bias – high variance spectrum. Now, we may tweak the parameters like a drunkard, hoping that we get them right, or we can do so through an understanding of the data and the model at hand. One way to accomplish the latter is through grid search, though this may not always be easy or affordable computationally. Imagine an SVM, for example, that is trained over a large dataset. It may take a while to find the optimum parameters for that model through a grid search, which is why we often revert to more stochastic approaches. This is where AI creeps in, even if we don't call it that. However, whenever a sophisticated optimization method is applied, the system exhibits a form of rudimentary intelligence. The more advanced the optimizer, the more it fits the bill, and calling it AI comes effortlessly.
Anyway, if we were to apply intelligence, artificial or otherwise, to a problem like that, we are in essence applying the hyperthesis principle. How well we do so, depends on how well we understand the problem we are trying to solve. However, being aware of this principle and applying it consciously can greatly facilitate the whole process. After all, all this is done through an iterative process, oftentimes involving several iterations of training and testing. Setting up the corresponding experiments can be aligned with the aforementioned principle, optimizing the whole process. So, instead of tweaking the model haphazardly, we make changes to it that make sense and navigate it towards a point in the parameter space that optimizes performance and robustness.
Understanding all this is the most important step in truly understanding AI and allowing this understanding to enhance our thinking. Also, it is at the core of the data science mindset. Cheers!
Hi everyone. Since these days I explore a different avenue for data science education, I've put together another webinar that's just 3 weeks away (May 18th). If you are interested in AI, be it as a data science professional or a stakeholder in data science projects, this is something that can add value to you. Also, you'll have a chance to ask me questions directly and if the time allows, even have a short discussion on this topic.
Note that due to the success of previous webinars in the Technics Publications platforms, the price of each webinar has risen. However, this upcoming webinar, which was originally designed as a talk for an international conference in Germany, is still at the very accessible price of $14.99. Feel free to check it out here and spread the word to friends or colleagues. You can also learn about the other webinars this platform offers through the corresponding web page. Cheers!
These days I didn't have a chance to prepare an article for my blog. Between helping out a friend of mine and preparing for my webinar this Thursday, I didn't have the headspace to write anything. Nevertheless, one of the articles I wrote for my friend's initiative, related to mentoring, is now available on Medium. Feel free to check it out!
As for the webinar, it's about the data science mindset, a topic I talked about on all of my books, particularly the Data Science Mindset, Methodologies, and Misconceptions one. At the time of this writing, there are still some spots available for the webinars, so if you are interested, feel free to register for it here.
On another note, my latest book is almost ready for the review stage so I'll be working on that come Friday. Stay tuned for more details in the weeks to come...
That's all for now. I hope you have a great week. Stay healthy and positive!
With more and more people getting into data science and AI these days, certain aspects of the field are inevitably over-emphasized while others are neglected. Naturally, those providing the corresponding know-how are not professional educators, even if they are competent practitioners and very knowledgeable people. As a result, a lot of emphasis is given to the technical aspects, such as math and programming related skills, data visualization, etc. What about domain knowledge though? Where does that fit in the whole picture?
Domain knowledge is all that knowledge that is specific to the domain data science or AI is applied on. If you are in the finance industry, it involves economics theory as well as how certain econometric indexes come into play. In the epidemiology sector, it involves some knowledge as to how viruses come about, how they propagate, and their effects on the organisms they exploit. Naturally, even if domain knowledge is specialized, it may play an important role in many cases. How much exactly depends on the problem at hand as well as how deep the data scientist or AI practitioner wants to go into the subject.
Domain knowledge may also include certain business-related aspects that also factor in data science work. Understanding the role of the different individuals who participate in a project is very important, especially if you are tackling a problem that is too complex for data professionals alone. Oftentimes, in projects like this, subject matter experts (SMEs) are utilized and as a data scientist or AI professional you need to liaise with them. This is not always easy as there is limited common ground that can be used as a frame of reference. That's where some general-purpose business knowledge comes in handy.
Naturally, incorporating domain knowledge in a data science project is a challenge in and of itself. Even if you do have this non-technical knowledge, you need to find ways to include it in the project organically, adding value to your analysis. That's why certain questions, particularly high-level questions that the stakeholders may want to be answered, are very important. Pairing these questions with other, more low-level questions that have to do with the data at hand, is crucial. Part of being a holistic, well-rounded data science / AI professional involves being able to accomplish this.
Of course, exploring this vast topic in a single or even multiple blog posts isn’t practical. Besides, how much can someone go into depth about this subject without getting difficult to read, especially if you are accessing this blog site via a mobile device? For this purpose, my co-author and I have gathered all the material we have accumulated on this topic and put it in a more refined form, namely a technical book. We are now at the final stages of this book, which is titled “Data Scientist Bedside Manner” and is published by Technics Publications. The book should be available before the end of the season. Stay tuned for more details...
In the previous post (not counting the webinars one, which was more of an announcement), I talked a bit about a new high-level model about scientific knowledge. However, I didn't talk much about its evolution, since that would make for a very long article (or even a book!). In this article, I'll look into some additional parts of this model and how it can help us understand the evolution of scientific knowledge. All this is closely tied to the data science mindset since, at its core, Data Science is applied science in real-world problems. So, in the previous article, we covered research, fidelity, and application as the key aspects of scientific knowledge and how the three of them are closely linked to a fourth one, the scope. But how do all these relate to the scientist and her work? Let's find out.
So, if you recall, the aforementioned factors can be visualized in the schematic we saw in the previous post.
But what lies in the middle of this? What’s at the heart of scientific knowledge? If you guessed the scientific method, you are right. After all, scientific knowledge doesn't grow on trees (with the exception of that apple tree upon which Newton was resting, perhaps). The scientific method is at the core of it since it binds research, fidelity, and even application to some extent. When an engineer (or even the scientist herself) explores a new theory and tests its validity, he makes use of the scientific method. Without it, he could still argue for or against the theory, but it would be more of a philosophical kind of treatise than anything else. Naturally, philosophy has value too, especially when it is a practical kind of philosophy, like that of the Stoics. However, in science, we are interested more in things that can be formulated with mathematical formulas and be tested rigorously through various data analytics tools, such as Statistics. This scientific method also constitutes the mindset of the scientist, something very important across different disciplines.
Now, if we were to explore this further, going beyond the plane of all the aforementioned aspects of scientific knowledge, we’d find (at least) two more aspects that are closely related to all this. Namely, we’ll find understanding and vision, both of which have to do with the scientist primarily. Understanding involves how deep we go into the ideas the scientific knowledge entails. It is not just rational though since it involves our intuition too. Understanding is like the roots of a tree, grounding scientific knowledge to something beyond the data and making the scientific theory we delve into something potentially imbued by enthusiasm. When you hear some scientists talk about their inventions, for example, you can almost feel that. No scientist would get passionate about math formulas but when it comes to the understanding of the scientific knowledge they have worked on, they can get quite passionate about it for sure!
On the other direction of this we have vision, which has to do with what we imagine about the scientific knowledge, be it its applicability, its extensions, and even the questions it may raise. The latter may bring about additional scientific projects, evolving the knowledge (and understanding) further. That's why it makes sense to visualize this as an upwards vector. Besides, we talk about understanding going deep, which is why we'd visualize it as a downwards vector. Naturally, we'd expect these to be correlated to some extent since deeper understanding would make for loftier visions regarding the scientific knowledge we explore. Also, these two aspects of scientific knowledge highlight the evolutionary aspect of it, rending it something highly dynamic and adaptive.
Hopefully, this article has shed some light on this intriguing topic. It may be a bit abstract but scientific knowledge is like this, at least until it manifests as technology. Feel free to share your thoughts on this topic through this blog. Cheers!
Webinars have been a valuable educational resource for years now, but only recently has the potential of this technology been valued so much. This is largely due to the Covid-19 situation that has made conventional conferences a no-no. Also, the low cost of webinars, coupled with the ecological advantage they have over their physical counterparts, makes webinars a great alternative.
At a time when video-based content is in abundance, it's easy to find something to watch and potentially educate yourself with. However, if you want quality content and value your time more than the ease of accessibility of the stuff available for free, it's worth exploring the webinar option. Besides, nowadays the technology is more affordable than ever before, making it a high ROI endeavor. As a bonus, you get to ask the presenter questions and do a bit of networking too.
How does all this fit with data science though and why is it part of this blog? Well, although webinars are good in general, they are particularly useful in data science as the latter is a hot topic. Because it's such a popular subject, data science has attracted all sorts of opportunists who brand themselves as data scientists just to make a quick buck. These people tend to create all sorts of content that is low veracity information at best (and a scam at worst). Since discerning between what's legitimate content and what's just click-bait can sometimes be difficult (these con artists have become pretty good at what they do), it makes sense to pursue reputable sources for this video content. One such source is the Technics Publications platform, which has recently started providing its own video content in the form of webinars. Although most of these webinars are on data modeling, a couple of them are on data science topics (ahem). Feel free to check them out!
Disclaimer: I have a direct monetary benefit in promoting these data science webinars. However, I do so after ensuring I put a lot of work in preparing them, the same amount of work I’d put in preparing for a physical conference, like Customer Identity World and Data Modeling Zone. The only difference is the medium through which this content is delivered.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.