For some reason, people who delve into data science tend to focus more on certain aspects of the craft at the expense of others. One of these things that often doesn’t get nearly enough attention is the concept of distance. If you ask a data scientist (especially one who is fairly new to the craft or overspecialized in one aspect of it), they’ll tell you about the distance metrics they are familiar with and how distance is a kind of similarity metric. Although all of this is true, it only portrays just one part of the picture.
I’ve delved into the topic for several years now and since my Ph.D. is based on transductive systems (i.e. data science systems that are based on distances), I’ve come to have a particular perspective on the matter, one that helps me see the incompleteness of it all. After all, no matter how many distance heuristics we develop, the way distance is perceived will remain limited until we look at it through a more holistic angle. So, let’s look at the different kinds of distances out there and how they are useful in data science.
Distances of the first kind are those most commonly used and are expressed through the various distance heuristics people have devised over the centuries. The most common ones are Euclidean distance and Manhattan distance. Mathematically, it is defined as the norm of a vector connecting two points.
Another kind of distances is the normalized ones. Every distance metric out there that is not in this category is crude and limited to the particular set of dimensions it was calculated in. This makes comparisons of distances between two datasets of different dimensionality impossible (if the meaning is to be maintained), even if mathematically it’s straight-forward. Normalizing the matrix of distances of the various data points in the dataset requires finding the largest distance, something feasible when the number of data points is small but quite challenging otherwise. What if we need the normalized distances of a sample of data points only because the whole dataset is too large? That’s a fundamental question that needs to be answered efficiently (i.e. at a fairly low big O complexity) if normalized distances are to be practical.
The last and most interesting kind of distances is the weighted distance. Although this kind of distance is already well-documented, the way it has been tackled is fairly rudimentary, considering the plethora of possibilities it offers. For example, by warping the feature space based on the discernibility scores of the various features, you can improve the feature set’s predictive potential in various transductive systems. Also, using a specialized weighted distance, you can better pinpoint the signal of a dataset and refine the similarity between two data points in a large dimensionality space, effectively rendering the curse of dimensionality a non-issue. However, all this is possible only through a different kind of data analytics paradigm, one that is not limited by the unnecessary assumptions of the current one.
Naturally, you can have a combination of the latter two kinds of distances for an even more robust distance measure. Whatever the case, understanding the limitations of the first kind of distances is crucial for gaining a deeper understanding of the concept and apply it more effectively.
Note that all this is my personal take on the matter. You are advised to approach this whole matter with skepticism and arrive at your own conclusions. After all, the intention of this post is to make you think more (and hopefully more deeply) about this topic, instead of spoon-feeding you canned answers. So, experiment with distances instead of limiting your thinking to the stuff that’s already been documented already. Otherwise, the distance between what you can do and what you are capable of doing, in data science, will remain depressingly large...
Lately, there has been an explosion of interest in Data Science, mainly due to the appealing job prospects of someone who has the relevant know-how. It is easy, unfortunately, to get into the state of complacency whereby data science become all too familiar and you find yourself working the same methods and the same processes in general when dealing with the problems you are asked to solve. This situation can be quite toxic though, even if it’s unlikely someone will tell you so. After all, as long as you deliver what you have to deliver no one cares, right? Unfortunately, no. If you stop evolving as a data scientist, chances are that you’ll become obsolete in a few years, while your approach to the problems at hand will cease to be as effective. Besides, the field evolves as do the challenges we as data scientists have to face.
The remedy to all this is exploring data science with a renewed sense of enthusiasm, something akin to what is referred to as “beginner’s mind” in the Zen tradition. Of course, enthusiasm doesn’t come about on its own after you’ve experienced it once. You need to create the conditions for it and what better way to do that than exploring data science further. This exploration can be in more breadth (i.e. additional aspects of the craft, including but not limited to new methods), and in more depth (i.e. understand the inner workings of various algorithms and the variants they may have). Research in the field can go a long way when it comes to both of these exploration strategies. It’s important to note that you don’t need to publish a paper in order to do proper research. In fact, you can do perfectly adequate research with just a computer and a few datasets, as long as you know how.
It’s also good to keep the breadth and depth in balance when you are exploring data science. Going too much in breadth can lead you to have a more superficial knowledge of the field while going too much in depth can make you overspecialized. What you do first, however, is totally up to you. Also, it’s important to use reliable resources when exploring the field, since nowadays it seems that everyone wants to be a data science content creator, without having the essential training or educational mindset. A good rule of thumb is to stick to content that has undergone extensive editings, such as the stuff made available through a publisher, particularly one specializing in data related books and videos.
Whatever the case, it’s always good to explore data science in an enjoyable manner too. Find a dataset you are interested in, before starting to apply some obscure method. This way the whole process will become more manageable and perhaps even fulfilling. Fortunately, there is no shortage of datasets out there, so you have many options. Happy exploration!
These days I did something I’d been putting off for a while now, as if it didn’t work out, it would mean that I’d have to throw away my computer, so to speak. I didn’t exactly meddle with any of the computer’s hardware but came as close to it as I could, without physically changing the machine. Namely, I tweaked the boot software and configured a new OS that I’m now using. “What’s wrong with the old OS?” you ask. Well, I’d tweaked it way too much in the past, so it was now quite unstable. Yet, even at this pitiful state, it was better than some other OSes I’ve had over the years, so it’s hard to complain about it.
Whatever the case, getting down to the nitty gritty of a computer isn’t easy and there is a surprising lack of people out there able or willing to help out. Also, the forums although generally useful, don’t always have the exact issue you are looking to solve, so you basically need to rely on your own skills. Fortunately, I did a thorough back-up of all my data beforehand, so nothing could get lost. Also, I was quite meticulous with the whole process and had a back-up plan in place. A lot of shell scripting was involved and although I'm not super confident about this type of interaction with a computer, it's not as daunting as it seems either. Of course, if you do it more, like professionals in the field, it may even seem the best way to interface with a computer. I'm not there yet though, but I have a deeper appreciation of the merits of this approach to interfacing than I did before.
This whole thing is akin to the engineering approach to things, where failure is always taken into account since things break more often than people think. Thinking that everything is going to be fine, just because it worked fine in someone’s presentation or tutorial is naive and doesn’t really spell out professionalism. That’s why having the right mindset about all this stuff is essential. Algorithms, equations, and coding libraries can only get you so far. After that, you are on your own and you need more than just a solid understanding of the theory but also the ability to deal with the adverse circumstances that will probably present themselves sooner rather than later.
Now, in you work as a data scientist or an A.I. professional you’ll probably have no need to do low level work on a computer (unless you are setting up a new pipeline), but if such a challenge presents itself, you are better off facing it. And who knows, maybe you’ll do more than just upgrade your computer through this whole process, since chances are that you’ll also be upgrading yourself.
So, what did I learn from this whole experience? First of all, I now have a deeper appreciation to all those people who do the low-level work in a data science pipeline. It may appear straight-forward from a high-level perspective, but when you get down to it, it isn't simple at all, even if you enjoy working on a CLI. Also, I learned that just because something isn't common enough to be on a forum or a blog article, it doesn't mean it's not important or worth doing. The OS upgrade I did helped realize how vast the spectrum of possibilities is when it comes to OSes and how deviating from the most popular approaches to it is probably the best way to go (or at least the most fox-like way!). Finally, I learned that when you've assembled something yourself, even if it's a fairly straight-forward OS, it makes you appreciate it more. Most things nowadays come preassembled and we don't have to do anything to get them to work, but those things that require our own energy to come to life, be it an OS or a custom data science model, these are the things we tend to remember the most since they change us inside...
So, after attending this truly eye-opening conference in Amsterdam last month, I felt obliged to share at least some of the stuff (most relevant to data science) I got from it with other people, through a reliable content sharing platform. So, I wrote an article about this topic on beBee and then created a video which is now available on Safari.
Note that this is a bit high-level as a video, with emphasis on managerial and senior-level data science practices, rather than hands-on aspects of the craft. However, every data scientist can benefit from this knowledge, especially when dealing with sensitive data. Also, Safari content requires a subscription in order to be accessible to its full length.
Well, like most things of a certain level of sophistication, the answer is it depends. But before we delve into this matter, let’s start with defining what DS research is exactly. By this term, I refer to with the advancement of the field through the experimentation around new ideas, methods, techniques, and even the development and testing of new algorithms applicable to data science. Sounds like a lot but in practice there is a great deal of specialization so it’s not as overwhelming. For example, someone may do research in the data science technology focusing on distributed computing, while someone else focus on the design of a new supervised learning technique or a heuristic.
But don’t you need funding for all this? Well, in the traditional approach to research funding, usually in the form of grants sponsored by a government or some large organization, is something essential. After all, scientific research requires a great deal of resources and people who although passionate about the subject, may not work for free. Nevertheless, the expenses of research in data science are minimal, meaning that you don’t need a huge grant in order to get the ball rolling.
In essence, when you do DS research your key expenses are your time and the cloud computing rental. After all, Amazon and Microsoft need to make some money too when you are using their cloud services for your projects. Still, the prototyping is something you can do on your own computer so the cloud bill doesn’t have to be very high, unless you are working with a particularly large dataset, one that qualifies as big data.
I’m not saying that everyone can do data science research on his own. However, nowadays it’s easier than ever before to experiment without a lot of facilities or some sponsorship for a research project. People have been publishing papers on their own for years now and unless you want to do some large-scale research project, you can work with limited resources. And who knows, maybe this idea of yours can morph into a business product or service that can be a data science start-up. It doesn’t hurt to try!
A good tool for doing data science research is Julia, particularly through the Jupyter IDE. You can learn more about the language through the corresponding website, while my book on it can be a great resource for delving into it deeper. Note that the book was written for an earlier version of the language so the code may not be compatible with the latest version (v. 1.0) of Julia. Cheers!
Trinary Logic is not something new. It’s been around for decades, though it was more of a mathematical / high-level framework. I should know, as I did my Masters thesis on this subject and how it applies to GIS. I even wrote code implementing the corresponding model I came up with, though in today’s programming world it seems like legacy code... Anyway, bottom line is that Trinary Logic is useful and could have a place in modern Information Systems, including data analytics projects. The question is, could it be applicable to A.I. too?
The answer is, as usual, “it depends.” Trinary Logic on its own is quite limited and unless you are familiar with its 700+ gates, it may be like any novel idea: interesting but not exactly something worth delving into. After all, just like any system of reasoning, Trinary Logic is meaningless without an in-depth understanding of its key contribution to the thorny issue we always tackle through reasoning: handling uncertainty effectively.
Uncertainty, oftentimes modeled as noise or randomness (depending on who you ask), is everywhere. Since we cannot eliminate it without damaging the signal too, we find ways to cope with it. Trinary Logic offers an interesting way of doing that through the 3rd value of its variables, namely the “indifferent” state. Something can be True, False, or Indifferent, the latter being something in-between. These are the states of those intermediate values in the membership functions of fuzzy variables, in Fuzzy Logic. The latter is a well-known and quite established A.I. framework with lots of applications in data science. Do you see where I’m going with this?
So Trinary Logic is a framework for reasoning, much like Fuzzy Logic, but the latter is an A.I. framework too, so Trinary Logic is A.I. also, right? Well, no. Trinary Logic is a mathematical construct, so unless it is applied to A.I. programmatically, and as a well-defined process, it is yet another concept that can’t even fetch an academic publication! But if it were to manifest as a heuristic of sorts and add value to a process in the A.I. sphere, things would be different.
Enter the Trinary Curve, a heuristic (or meta-heuristic, depending on how you use it) that encapsulates Trinary Logic in a simple yet not simplistic way, turning an input signal into something that an A.I. agent can understand and work with. Namely, it can engineer a new variable in the [-1, 1] interval (notice the closed brackets in this case), that enables the corresponding module to have the in-between state of uncertainty more evident. As a result, the A.I. agent is allowed to be unsure about something and examine it more closely, given the right architecture, instead of working with what it has and hope for the best. Note that the Trinary Curve can be customized, while its output can be normalized to a different interval (always closed) if needed. The Trinary Curve is differentiatable throughout the space it is defined, while it’s easy to use programmatically (at least in Julia).
Perhaps the Trinary Curve is a novelty and an A.I. system can evolve adequately without it. However, it is something worth considering, instead of just experimenting with the countable parameters of existing A.I. systems solely. After all, Trinary Logic is compatible with existing A.I. frameworks so if it’s not utilized, it’s primarily because of some people’s unwillingness to think outside the box, and that’s something that doesn’t have any uncertainty about it...
This week I'm away, as I prepare for my talk at the Consumer Identity World EU 2018 conference in Amsterdam (the same conference takes place in a couple of other places, but I'll be attending just the one in Europe). So, if you are in the Dutch capital, feel free to check it out. More information on my talk here. Cheers!
Dichotomy: a binary separation of a set into two mutually exclusive subsets
Data Science: the interdisciplinary field for analyzing data, building models, and bringing about insights and/or data products, which add value to an organization. Data science makes use of various frameworks and methodologies, including (but not limited) to Stats, ML, and A.I.
After getting these pesky definitions out of the way, in an effort to mitigate the chances of misunderstandings, let’s get to the gist of this fairly controversial topic. For starters, all this information here is for educational purposes and shouldn’t be taken as gospel since in data science there is plenty of room for experimentation and someone adept in it doesn’t need to abide to this taxonomy or any rules deriving from it.
The inaccurate dichotomy issues in data science, however, can be quite problematic for newcomers to the field as well as for managers involved in data related processes. After all, in order to learn about this field a considerable amount of time is required, something that is not within the temporal budget of most people involved in data science, particularly those who are starting off now. So, let’s get some misconceptions out of the way so that your understanding of the field is not contaminated by the garbage that roams the web, especially the social media, when it comes to data science.
Namely, there are (mis-)infographics out there that state that Stats and ML are mutually exclusive, or that there is no overlap between non-AI methods and ML. In other words, ML is part of AI, something that is considered blasphemy in the ML community. The reason is simple: ML as a field was developed independently of AI and has its own applications. AI can greatly facilitate ML through its various network-based models (among other systems), but ML stands on its own. After all, many ML models are not AI related, even if AI can be used to improve them in various ways. So, there is an overlap between ML and AI, but there are non-AI models that are under the ML umbrella.
Same goes with Statistics. This proud sub-field of Mathematics has been the main framework for data analytics for a long time before ML started to appear, revolting against the model-based approach dictated by Stats. However, things aren’t that clear-cut. Even if the majority of Stats models are model-based, there are also models that are hybrid, having elements of Stats and ML. Take Bayesian Networks for example, or some variants of the Naive Bayes model. Although these models are inherently Statistical, they have enough elements of ML that they can be considered ML models too. In other words, they lie on the nexus of the two sets of methods.
What about Stats and AI? Well, Variational AutoEncoders (VAEs) are an AI-based model for dimensionality reduction and data generation. So, there is no doubt that it lies within the AI set. However, if you look under the hood you’ll see that it makes use of Stats for the figuring out what the data generated by it would be like. Specifically, it makes use of distributions, a fundamentally statistical concept, for the understanding and the generation of the data involved. So, it wouldn’t be far-fetched to put VAEs in the Stats set too.
From all this I hope it becomes clear that the taxonomy of data science models isn’t that rigid as it may seem. If there was a time when this rigid separation of models made sense, this time is now gone as hybrid systems are becoming more and more popular, while at the same time the ML field expands in various directions outside AI. So, I’d recommend you take those (mis-)infographics with a pinch of salt. After all, most likely they were created by some overworked employee (perhaps an intern) with a limited understanding of data science.
Interestingly, the video throughput on Safari has increased lately so we don't have to wait too long before a video gets approved and published. This little guy, for example, I just finished on Thursday and it's already online at the Safari platform. It's by no means an exhaustive survey of the ML field, which is much larger than many people think and it doesn't include A.I. methods only. This video is more of an overview of ML and how it relates to other aspects of Data Science, such as Statistics, A.I., and various applications. So, if you are new to Data Science or want to get a comprehensive overview of the topic to supplement your studies of the subject, feel free to check it out!
With all the plethora of material out there for data science education, it is easy to get overwhelmed and even confused about what to study and how much time, money, and effort to put into it. Enter evaluation of data science material, a concise strategy for tackling this issue. In this 24 minute video, I talk about the various aspects of data science material, criteria for evaluating it, the matter of resources required to delve into this material, and some useful things to have in mind in your data science education efforts. Whether you are a newcomer to the field or a more seasoned data scientist, you have something to learn about data science (I know I do!) and this video can hopefully aid you in that. You can find it on Safari.
Note that in order to be able to view this video in its entirety, you'll need a subscription for the Safari platform. Also, it's important to remember that this video can offer you a framework for evaluating the data science material; you'll still need to find that material though and put the effort to study it, in order to make the most of it. The video can only help you organize your efforts more efficiently. Enjoy!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.