As privacy matters gain more value these days, transparency also gains a lot of value in data science work. This is to be expected for another reason, which I hope it has become more obvious if you have been following this blog: transparent models are easier to explain to others. Beyond these advantages, there are other ones too (such as transparent models are easier to tweak and optimize), which I'm not going to elaborate on right now. Instead, I'm going to look at the various data models used in data science and where they fall on the transparency spectrum.
On the one extreme of this spectrum lie the most transparent data models. These are usually Stats-based since they can provide exact proportions of each feature's contribution. Also, you know exactly what's going on with the decisions involved in the predictions they yield. Even if you know nothing about data science you can still make sense of these models and understand the predictions they yield. The main disadvantage of these models is that they are not as accurate, partly because of the overly simple processes they use.
On the other extreme of the spectrum, you can find the most opaque data models. These are usually AI-based and are referred to as black boxes. Not only do they not tell us anything about feature importance, but trying to explain their inner workings is a futile task. However, they tend to have an edge in performance when it comes to accuracy, plus they require very little prep work for the data they use (data engineering).
Somewhere in the middle of the spectrum lie all the other models, mostly under the machine learning category. These include random forests and boosted trees (some transparency), k nearest neighbor (very little transparency), support vector machines (no transparency), and fuzzy logic systems (pretty decent transparency). That’s a category of models most people forget since they tend to think of transparency as a binary attribute.
Finally, it’s good to remember that transparency is usually linked to a business requirement. Also, sometimes the performance you obtain from the black box models is a good trade-off since some projects require high accuracy in the predictions involved. So, transparency is not always a necessity even if it can facilitate the communication of these models to the project stakeholders. As a result, it's always good to think about whether you need this extra transparency that a statistical model may offer you if you can achieve better performance with a less transparent model.
For more information about transparency and other aspects of data science models (particularly machine learning related), you can check out my latest book, Julia for Machine Learning. It is a very hands-on kind of book, which doesn't neglect to provide a lot of information needed to build the right mindset when it comes to data science work. Also, it includes lots of examples and links to useful resources that can help you understand all the concepts involved.
Data analytics is the field of analyzing data and using any insights you've discovered to facilitate an organization's workflow. It has a very hands-on approach to things and focuses on describing a problem accurately and liaising with the stakeholders to drive decisions based on the data analyzed. Data science is akin to that, while it employs the scientific method to go deeper into the data and develop more sophisticated strategies to drive those decisions.
Nowadays, the roles of the data analyst and the data scientist are somewhat mixed since the latter is still relatively new. The fact that its evangelists haven't publicized it accurately makes it even more challenging to understand how it fills a somewhat different niche as a role. A data analyst is more wide-spread as a role and ties to a large variety of tasks, including marketing analytics (e.g., SEO) and business intelligence (BI) work. An organization usually leverages a data scientist in cases with more unstructured data (e.g., text) or data from various forms, making data wrangling a necessary part of the analysis. Also, in scenarios where the objectives are not as clear-cut (e.g., predict the sales of the next quarter), a data scientist is usually preferred. However, it's worth pointing out that many data scientists start their careers as data analysts and that both roles are necessary. After all, they both work with data to produce insights, even if they often go about it differently.
Beyond the differences mentioned previously, another critical difference between the two roles is the models used. Namely, the data analyst is more geared towards describing the data and understanding the problem it represents. The data scientist goes a step further and also makes predictions (through mathematical models, usually based on machine learning), digging deeper into it. As a result, she can put together a complete solution, such as a predictive model accessible through an API. Naturally, she can also create a dashboard, something that is among the data analyst's deliverables.
What's more, although data analysts can tackle all sorts of data, usually when it comes to text and semi-structured data, that's where they draw the line. For this sort of datasets, specialized methods are required, such as Natural Language Processing (NLP), which falls in the data scientist's domain. The use of AI is often essential in problems like this, something that a data scientist is usually required to know, but beyond the job description of a data analyst.
You can learn more about data science and the data scientist's role through a couple of my books. Specifically, the book Data Scientist: The Definite Guide to Becoming a Data Scientist, illustrates the ins and outs of this role and some practical advice as to how you can pursue a career in this field. Additionally, the Data Science Mindset, Methodologies, and Misconceptions book showcases the field overall and its defining aspects as well as the essential techniques used. Both books together offer a birds-eye view of data science and how you can build your career in it. Check them out when you have a chance. Cheers!
Statistics is a very interesting field that has some relevance to data science work. Perhaps not as much as some people claim, but it's definitely a useful toolset, particularly in the data exploration part of the pipeline. But how can Stats improve, particularly when it comes to new technologies like A.I.?
Statistics can benefits from such technologies in various ways. The most important is the mindset of the AI-based approach, which is empirical and pragmatic. Instead of imagining complex theories for explaining the data, AI looks at it as it is and works with it accordingly (the data-driven approach).
Conventional Statistics on the other hand tries to fit the data into this or the other mathematical model for describing it and then it processes the data with some arbitrary metrics that are mediocre at best. However, the math is elegant, so we go with it anyway. So many textbooks can’t be wrong, right? Well, in data science there is no right or wrong, just models that work well and others that don’t work as well. Since there is usually money on the line, we prefer to go with the former models, which coincidentally tend to be machine learning related, particularly AI-based. So, there is surely room for improvement for Statistics if it were to adopt the same mindset.
What's more, Statistics can benefit from A.I. through additional heuristics (or statistics, as they are often referred to in that context). The existing heuristics may work well, but they are very narrowly defined, making them overly specialized. AI-based heuristics are broader and tend to be more applicable in a variety of data sets. If Statistics were to adopt a similar approach to heuristics, it would for sure benefit greatly and become more widely applicable.
Finally, Statistics can benefit from A.I. by embracing a different approach to describing the data. This is a more fundamental change and probably most fans of the field will disregard it as impractical. However, it is feasible and even efficient, with a bit of clever programming (based on heuristics and a geometrical approach). The latter is something Statistics seems to be divorced from, which is another area of improvement.
It's worth noting that although developments in Statistics are bound to be beneficial to anyone applying it in data analytics projects, the practitioner also needs to evolve. There is no point in advancing this field if its practitioners remain in their old ways, limited and rigid. Perhaps that's one of the reasons machine learning and A.I. have advanced so much; their practitioners are more open to changes and willing to adapt. No wonder these fields now dominate the data science world. Something to think about...
PS - This article was supposed to be published yesterday. However, there was an issue with the scheduler, hence the delay. Normally, I'll have new material every Monday and sometimes on Thursdays too. Cheers!
With all this talk these days about Statistics and other frameworks and their immense value in data science, it’s good to be more pragmatic about this matter. After all, it’s not a coincidence that Machine Learning maintains the top position both as a framework and as a specialization when it comes to data science work. In this article, we'll explore why this is.
First of all, machine learning is a more scientific paradigm for data science. It doesn't make any assumptions as it relies on the data at hand and nothing else. Well, there are also the ML models that it makes use of, but it doesn't try to model everything as this or the other distribution and rely on metrics based on these distributions. The scientific approach has proven itself to be very useful in understanding the world, so it only makes sense that it is used (in the form of machine learning methods) in data science too.
What’s more, machine learning makes use of more advanced methods than other frameworks. After all, it makes sense that if a framework works well, as in the case of machine learning, more methods are researched and refined. As a result, the models that machine learning brings to the table are more state-of-the-art and efficient. This makes using the machine learning framework a no-brainer, particularly when it comes to critical processes where accuracy and efficiency are key requirements.
Also, machine learning nowadays is powered to a great extent by AI, creating powerful models that outperform anything else available to a data scientist. This may be a trend that's here to stay since many AI-based model have proven to be exceptionally good and versatile. Although these models have special requirements that may not be met in every data science option, it's good that there is this option available for data science work.
Moreover, machine learning is easier to learn and use since it doesn't have a lot of theory behind it. As a result, you don't need to spend a lot of time learning it or having to worry about the requirements of each model, like in Statistics. Of course, there is some theory in this framework too, but it's fairly straight-forward and doesn't require too specialized math to learn it to an adequate degree.
Finally, there are lots of libraries nowadays for every machine learning model or process, making it easy to implement. In other words, you don't have to do a lot of coding to get your machine learning method up and running. Also, the fact that there is usually adequate documentation in these libraries makes it easier to understand the corresponding programs and the techniques too, supplementing your learning.
Speaking of learning, if you wish to learn more about machine learning through a hands-on approach to the subject, feel free to check out my latest book, Julia for Machine Learning (Technics Publications). There I talk about the subject in some depth, while I explain how you can use Julia to deploy different kinds of machine learning models and heuristics. Cheers!
Throughout our careers in data science and AI, we constantly encounter all sorts of obstacles that hinder our development. This is something inevitable, particularly when we undertake a role that's constantly evolving. However, the biggest obstacle is not something external, as one might think, but something closer to home. On the bright side, this means that it’s more within our control than anything subject to external circumstances. Let’s clarify.
The biggest obstacle is related to the limits of our aptitude, something primarily linked to our knowledge and know-how. After all, no one knows all there is to know on a subject so broad as data science (or AI). However, as we gather enough knowledge to do what we are asked to, we are overwhelmed by the idea that we know enough. Eventually, this can morph into a conviction and even expand, letting us cultivate the illusion that we know everything there is to know in our field. Naturally, nothing could be further from the truth since even a unicorn data scientist has gaps in her knowledge.
One great way to avoid this obstacle is to constantly challenge yourself in anything related to our field. I'm not talking about Kaggle competitions and other trivial things like that. After all, these are hardly as realistic as data science challenges. I'm referring to challenges to techniques and methods that you are lacking as well as refining those that you already have under your belt. This may seem simple but it's not, especially since no one enjoys becoming aware of the things he doesn't know or doesn't know fully. Perhaps that’s why developing ourselves isn’t something easy or popular.
Another way to enhance ourselves is through reading technical books related to our field. Of course, not all such books are worth your while, but if you know where to look, it's not as challenging a task. What's more, it's good to remember that the value of such a book also depends on how you process this new information. For example, in many such books, there are exercises and problems that the reader is asked to solve. By taking advantage of such opportunities, you can learn the new material better and grow a deeper understanding of the topics presented.
One way for learning more is through Technics Publications books. Although many of the books from that publishing house are related to data modeling, there are a few data science-related ones as well as a couple on AI. Of course, even the data modeling books can be useful to a data scientist, since we often need to deal with databases, particularly in the initial stages of a project. Also, if you were to buy a book from this publisher using the coupon code DSML, you can get a 20% discount. The same applies to any webinars you may register for. So, if the cost of this material is an obstacle for you, at least with this code you can alleviate it and get a bigger bang for your buck!
Although it's fairly easy to compare two continuous variables and assess their similarity, it's not so straight-forward when you perform the same task on categorical variables. Of course, things are fairly simple when the variables at hand are binary (aka dummy variables), but even in this case, it's not as obvious as you may think.
For example, if two variables are aligned (zeros to zeros and ones to ones), that’s fine. You can use Jaccard similarity to gauge how similar they are. But what happens when the two variables are reversely similar (the zeros of the first variable correspond to the ones of the second, and vice versa)? Then Jaccard similarity finds them dissimilar though there is no doubt that such a pair of variables may be relevant and the first one could be used to predict the second variable. Enter the Symmetric Jaccard Similarity (SJS), a metric that can alleviate this shortcoming of the original Jaccard similarity. Namely, it takes the maximum of the two Jaccard similarities, one with the features as they are originally and one with one of them reversed.
SJS is easy to use and scalable, while its implementation in Julia is quite straight-forward. You just need to be comfortable with contingency tables, something that’s already an easy task in this language, though you can also code it from scratch without too much of a challenge. Anyway, SJS is fairly simple a metric, and something I've been using for years now. However, only recently did I explore its generalization to nominal variables, something that’s not as simple as it may first seem.
Applying the SJS metric to a pair of nominal variables entails maximizing the potential similarity value between them, just like the original SJS does for binary variables. In other words, it shuffles the first variables until the similarity of it with the second variable is maximized, something that’s done in a deterministic and scalable manner. However, it becomes apparent through the algorithm that SJS may fail to reveal the edge that a non-symmetric approach may yield, namely in the case where certain values of the first variable are more similar toward a particular value of the second variable. In a practical sense it means that certain values of the nominal feature at hand are good at predicting a specific class, but not all of the classes.
That's why an exhaustive search of all the binary combinations is generally better, since a given nominal feature may have more to offer in a classification model if it's broken down into several binary ones. That's something we do anyway, but this investigation through the SJS metric illustrates why this strategy is also a good one.
Of course, SJS for nominal features may be useful for assessing if one of them is redundant. Just like we apply some correlation metric for a group of continuous features, we can apply SJS for a group of nominal features, eliminating those that are unnecessary, before we start breaking them down into binary ones, something that can make the dataset explode in size in some cases.
All this is something I’ve been working on the other day, as part of another project. In my latest book “Julia for Machine Learning” (Technics Publications) I talk about such metrics (not SJS in particular) and how you can develop them from scratch in this programming language. Feel free to check it out. Cheers!
The concept of antifragility is well-established by Dr. Taleb and has even been adopted by the mainstream to some extent (e.g. in Investopedia). This is a vast concept and it’s unlikely that I can do it justice, especially in a blog post. That’s why I suggest you familiarize yourself with it first before reading the rest of this article.
Antifragility is not only desirable but also essential to some extent, particularly when it comes to data science / AI work. Even though most data models are antifragile by nature (particularly the more sophisticated ones that manage to get every drop of signal from the data they are given), there are fragilities all over the place when it comes to how these models are used. A clear example of this is the computer code around them. I’m not referring to the code that’s used to implement them, usually coming from some specialized packages. That code is fine and usually better than most code found in data science / AI projects. The code around the models, however, be it the one taking care of ETL work, feature engineering, and even data visualization, may not always be good enough.
Antifragility applies to computer code in various ways. Here are the ones I’ve found so far:
All this may seem like a lot of work and it may not agree with your temporal restrictions, particularly if you have strict deadlines. However, you can always improve on your code after you’ve cleared a milestone. This way, you can avoid some Black Swans like an error being thrown while the program you’ve made is already in production. Cheers!
Last week I had to do a major operation for my computer. Namely, I had to replace the hard drive, as it was failing (regular warnings from the computer’s SMART diagnostics reminded me of the fact). Also, the fact that there wasn’t a single computer shop around that was a) open for business and b) willing to undertake such a task, didn’t help things. So, after waiting for about a month for a new hard disk to arrive by post, I took out my toolbox and started the operation of changing the SSD that my computer had. Naturally, I had backed up all my data beforehand and gotten a USB disk ready with an OS image installed so that I use after the new hard disk was up and running.
I won’t go into detail regarding the unbelievable challenges this process entailed (from stuck screws to a failing USB disk, to archive files that were apparently corrupt and unable to restore their content to the new hard disk). Instead, I’d like to focus on the gist of this whole experience, something that’s by far more relateable than the specifics of my situation. In essence, this whole situation was a “close to the metal” kind of experience, one that was both grounding and educational in a hands-on sense. Planning things is fairly easy but executing the plan and improvising alternative routes due to unforeseen (and possibly unforeseeable) circumstances is something we can all learn about. For example, at one point I had to find a different way to get the system running (an alternative USB disk), do a video call with a friend of mine (thanks Matt!) to troubleshoot the issue, and even come up with a contingency for backing up data in the future, so that it’s less prone to issues.
How does all this relate to data science? Well, in data science / AI projects we often have to deal with challenging situation that require us to get out of our comfort zone. We may even need to go to “closer to the metal” territory, e.g. the OS shell, for ETL tasks and such. Also, we may have to re-examine the architecture of the model used (e.g. the number of nodes for each layer, in the case of an ANN), the data used for the training of the model (do we really need all of the variables / data points?), and other factors that we often don’t think about.
Being closer to the metal is not something that concerns programmers only or computer technicians. It’s a state of mind that can come in very handy, even in high-level professions such as ours. Just like a good leader in a company has good relations with every echelon of his organization, even people he doesn’t interact with on a regular basis, a good data scientist ought to do the same. Detachment is useful in problem-solving but let’s not make it our default way of being. Sometimes we need to roll up our sleeves and handle tools we don’t usually use (e.g. the aforementioned screwdriver). With the right attitude, this can be a growing experience. Cheers!
If you don’t know what the word hyperthesis means don’t worry, it’s a term I came up myself. Stemming from the Greek “υπέρθεση” which means “hyperposition” or “superposition” depending on how you translate it, it is a term that describes transcendence of the binary state, but in a dynamic context (not to be confused with the quantum superposition which is somewhat different). In other words, it has to do with the controlled oscillation between extreme states until an equilibrium state is attained, at least at a reasonable robustness level that is predefined in the specs of the project at hand.
The Hyperthesis Principle is, therefore, a principle that describes the behavior of a complex system that is characterized by a hyperthetical behavior. Namely, if a system's state oscillates between two extreme states until it reaches an equilibrium of sorts, it exhibits hyperthetical behavior. If this behavior is a function of the parameters of the data this system relies on, then the system can, in theory, attain a stable evolutionary course that will result in equilibrium, namely a robust state.
“What does this have to do with data science, doc?” I can hear you say. Well, if you have been reading my blog, you may recall that predictive data models, especially the more sophisticated ones, are in essence complex systems. As such they may be in any state in the high bias – high variance spectrum. Now, we may tweak the parameters like a drunkard, hoping that we get them right, or we can do so through an understanding of the data and the model at hand. One way to accomplish the latter is through grid search, though this may not always be easy or affordable computationally. Imagine an SVM, for example, that is trained over a large dataset. It may take a while to find the optimum parameters for that model through a grid search, which is why we often revert to more stochastic approaches. This is where AI creeps in, even if we don't call it that. However, whenever a sophisticated optimization method is applied, the system exhibits a form of rudimentary intelligence. The more advanced the optimizer, the more it fits the bill, and calling it AI comes effortlessly.
Anyway, if we were to apply intelligence, artificial or otherwise, to a problem like that, we are in essence applying the hyperthesis principle. How well we do so, depends on how well we understand the problem we are trying to solve. However, being aware of this principle and applying it consciously can greatly facilitate the whole process. After all, all this is done through an iterative process, oftentimes involving several iterations of training and testing. Setting up the corresponding experiments can be aligned with the aforementioned principle, optimizing the whole process. So, instead of tweaking the model haphazardly, we make changes to it that make sense and navigate it towards a point in the parameter space that optimizes performance and robustness.
Understanding all this is the most important step in truly understanding AI and allowing this understanding to enhance our thinking. Also, it is at the core of the data science mindset. Cheers!
Hi everyone. Since these days I explore a different avenue for data science education, I've put together another webinar that's just 3 weeks away (May 18th). If you are interested in AI, be it as a data science professional or a stakeholder in data science projects, this is something that can add value to you. Also, you'll have a chance to ask me questions directly and if the time allows, even have a short discussion on this topic.
Note that due to the success of previous webinars in the Technics Publications platforms, the price of each webinar has risen. However, this upcoming webinar, which was originally designed as a talk for an international conference in Germany, is still at the very accessible price of $14.99. Feel free to check it out here and spread the word to friends or colleagues. You can also learn about the other webinars this platform offers through the corresponding web page. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.