I was never into Clustering. My Ph.D. was in Classification, and later on, I explored Regression on my own. I delved into unsupervised learning too, mostly dimensionality reduction, for which I've written extensively (even published papers on it). For some reason, Clustering seemed like a solved problem, and as one of my supervisors in my Ph.D. was a Clustering expert (he had even written books on this subject) I figured that there isn't much for me to offer there. Then I started mentoring data science students and dug deeper into this topic. At one point, I reached out to some data scientists I'd befriended over the years asking them this same question. The best responses I got were that DBSCAN is mostly deterministic (though not exactly deterministic if you look under the hood) and that K-means (along with its powerful variant, K-means++) was lightweight and scalable. So, I decided to look into this matter anew and see if I could clean up some of the dust it has accumulated with my BROOM.
Please note that when I started looking into this topic, I had no intention to show off my new framework nor to diminish anyone's work on this sub-fiend of data science. I have great respect for the people who have worked on Clustering algorithms, be it in research or their application-based work.
With all that out of the way, let's delve into it. First of all, deterministic Clustering is possible even if many data scientists will have you believe otherwise. One could argue that any data science algorithm can be done deterministically though this wouldn't be an efficient approach. That's why stochastic algorithms are in use, particularly in challenging problems like Clustering. There is nothing wrong with that. It's just frustrating when you get a different result every time you run the algorithm and have to set a random seed to ensure that it doesn't change the next time you use that code notebook where it lives. So, deterministic is an option, just not a popular one.
What about being lightweight? Well, if it's an algorithm that requires running a particular process again and again until it converges (like K-means), maybe it's lightweight, but probably not so much since it's time-consuming. Also, most algorithms worth their salt aren't as simple as K-means, which though super-efficient, leaves a lot to be desired. Let's not forget the assumptions it makes about the clusters and its reliance on distance, which tends to fail when several dimensions are present. So, in a multi-dimensional data space, K-means isn't a good option, and just like any other clustering algorithm, it struggles. DBSCAN struggles too, but for a different reason (density calculations aren't easy, and in multi-dimensional space, they are a real drag).
So, where does that leave us? Well, this is quite a beast that we have to deal with (the combination of a deterministic process and it being lightweight), so we'll need a bigger boat! We'll need an enormous boat, one armed with the latest weapons we can muster. Since we don't have the computational power for that, we'll have to make do with what we have, something that none of the other brilliant Clustering experts had at their disposal: BROOM. This framework can handle data in ways previously thought impossible (or at least unfeasible). High dimensionality? Check. Advanced heuristics for similarity? Check. An algorithm that features higher complexity without being computationally complex? Check. But the key thing BROOM yields that many Clustering experts would kill for is the initial centroids. Granted that they are way more than we need, it's better than nothing and better than the guesswork K-means relies on due to its nature.
In the toy dataset visualized above, I applied the optimal clustering method I've developed based on BROOM, there were two distinct groups in the dataset across the approximately 600 data points located on a Euclidean plane. Interestingly, their centers were almost the same, so K-means wouldn't have a chance to solve this problem, no matter how many pluses you put after its name. The initial centroids provided by BROOM were in the ballpark of 75, which is way too high. After the first phase of the algorithm, they were reduced to 7 (!) though even that number was too high for that dataset.
After some refinement, which took place in the second phase of the algorithm, they were reduced to 2. The whole process took less than 0.4 seconds on my 5-year-old laptop. The outputs of that Clustering algorithm included the labels, the centroids, the indexes of the data points of each cluster, the number of data points in each cluster, and the number of clusters, all as separate variables. Naturally, every time the algorithm was run it yielded the same results since it's deterministic.
Before we can generalize the conclusions that we can draw from this case study, we need to do further experimentation. Nevertheless, this is a step in the right direction and a very promising start. Hopefully, others will join me in this research and help bring Clustering the limelight it deserves, as a powerful data exploration methodology. Cheers!
I've been trying to answer this question for years. Well, not many years, but still, at least since the second half of the previous decade. Why? Well, I've always liked to explore the boundaries between the continuous and the discrete and since I finally internalized the teaching that everything in this universe is discrete (see: Quantum Physics), I decided to explore that angle and see if there was indeed a way to turn a continuous variable into a discrete one, with minimal information loss.
Over the past few months, I've developed three distinct approaches, depending on how distinct the values of the target variable are (see what I did there?). Let's start with something simple: no target variable at all. So, how can we discretize a continuous variable x? Well, you have to binarize it until there is no more binarization possible! But how do you optimally binarize a variable? That's something that involves densities after you handle all outliers and inliers in x, of course. How do you do that? Well, that's a topic that can fill a whole book chapter, so I'll have to draw the line here, I'm afraid.
What about when there is a target variable? Let's start with a binary one as it's simpler this way. We can employ a robust similarity metric that can assess the similarity of two binary variables, regardless of their alignment or any similarities due to chance. Fortunately, I've developed one such metric, which I call holistic symmetric similarity (HSS), which also works with all sorts of discrete variables. So, by using this metric, we can optimize the split to maximize the HSS score between the binarized x and the target variable y. The same approach works if y is discrete but not binary since I've generalized HSS to handle nominal variables too.
Ok, but what about when y is continuous, though? Well, that takes a bit more creativity since it's not as simple a task as it may seem. Fortunately, it's doable and relatively light, computationally speaking. We can find the threshold that maximizes a custom correlation metric that becomes larger once any non-linearities are tackled. This process doesn't have to be rocket science since I'm sure you can come up with a metric like that if you've been mentored by someone worth his salt in data science. Of course, you could use a translinearity correlation metric, yet, I wouldn't recommend that since it would inevitably pick up signals you wouldn't want it to, plus it's bound to be more computationally heavy.
So, there you have it. You can binarize and therefore discretize any feature x you like, with or without a target variable. The latter can be binary, discrete, or even continuous, depending on the problem at hand. Such a process can help you preserve computational resources and perhaps even enable you to make better and more transparent models (after all, binary variables tend to be easier on the mind, not just on the computer). All this I've done in the OD.jl script, which I cannot share here, unfortunately, as it has dependencies on proprietary code (the BROOM framework), which I'd rather not give away. Still, if you wish to explore this topic further, we can do that in a one-on-one mentoring session or two, given that you have the required commitment to the craft and a genuine interest to learn more about it. Cheers!
Many people talk about strategy nowadays, from the strategy of a marketing campaign to business strategy, and even content strategy. However, strategy is a more general concept that finds application in many other areas, including data science. In this article, we'll look at how strategy relates to data science work, as well as data science learning.
Strategy is being able to analyze a situation, create a plan of action around it, and following that plan. Strategy is relevant when there are other people (players) involved, as it deals with the dynamics of the interactions among all these people. It's a vast field, often associated with Game Theory, the brainchild of John Nash, considered to be one of the best modern Mathematicians (he even won the Nobel prize for this work, once his work's applications in Economics were discovered). In any case, strategy is not something to be taken lightly, even if there are more lighthearted applications of it out there, such as strategy games, something about which I'm passionate.
Strategy applies to data science too, however, as the latter is a complex matter that also involves lots of people (e.g., the project stakeholders). Thinking about data science strategically is all about understanding the risks involved, the various options available, and employing foresight in your every action as a data scientist. It's not just a responsible role (esp. when dealing with sensitive data) but also a role crucial in many organizations. After all, in many cases, it's us who deliver insights that effect changes in the organization or bring about valuable (and often profitable) products or services, which the organization can market to its clients.
Strategy in data science is all about thinking outside the box and understanding the bigger picture. It's not just the datasets at hand that matter, but how they are leveraged and used to build valuable data products. It's about mining them for insights significant to the stakeholders instead of coming up with findings of limited importance. Data science is practical and hands-on, just like the strategies that revolve around it.
Strategy in data science is also relevant to how we learn it. We may go for the more established option of doing a course on it and reading a textbook or two that the instructor recommends. However, this is just one strategy and perhaps not the best one for you. Mentoring is another strategy that's becoming increasingly important these days since it's more hands-on and personal in the sense that it addresses specific issues that you as a learner have throughout your assimilating of the newfound data science knowledge. Another powerful strategy is videos and quizzes that provide you with valuable knowledge and know-how, which enable you to get a more intuitive understanding of a data science topic. Of course, there is also the strategy of combining two or more such strategies for a more holistic approach to data science learning.
Choosing a strategy for your data science work or your data science learning isn't easy. This matter is something you often need to think about and evaluate over several days. In any case, usually data science educational material can help you in that and can also supplement your work, enriching your skill-set. Some such material you can find among the books I've published as well as the video courses I've created (e.g., those on Cybersecurity). I hope they can help you in your data science journey and make it easier and more enjoyable. Cheers!
What Rust is
I may have mentioned Rust in the past, but now I’d like to talk more about it and its role in data science and A.I., as it has passed the test of time, in my view. After having delved into Rust programming a bit, enough to understand that it's much more challenging than I realized at first, I believe I can now write about it with confidence. Also, since it's not so new to me, I'm way past the infatuation stage that characterizes most people who have talked or written about it, usually shortly after they started exploring it.
So, Rust is a high-performance language, currently in version 1.51, and with a large enough community of users (and companies) to make a dent in the programming realm. There is even a Rust track in the Exercism platform, where there are dedicated mentors who can help you learn it through the carefully designed and curated programming drills on the Exercism website. What's more, there are a few interesting books on Rust, while there are also conferences and workshops for anyone serious about this language.
Rust’s key strengths
Rust isn't popular because of its particular name or its cool logo, though. Rust earned its popularity through the strengths it brings to the table and the value-adds that accompany its deployment. First of all, it's high-performance, meaning that you can use it instead of C, C++, or even Java. That's not an easy-to-accomplish thing, and few languages have accomplished that. Also, it offers this performance while maintaining a relatively high-level approach to programming, much like most modern languages that come about.
Additionally, Rust is reliable and as safe as it gets. Many consider it to be better in that respect than even C, which has a series of memory management issues resulting in risky code. So, if you want to build a program that just works and won't make you sleep with your phone on at night (in case you'll need to fix an issue of a script you've shipped), Rust is a good option.
Finally, Rust is geared towards productivity. It's not an academic language or something a bunch of hobbyists put together, far from that. Rust is built for devs and people who are dead serious about designing and deploying software. The language's well-written documentation adds to this. At the same time, its error messages, although frustrating at first, give you some actual insight as to what's wrong with your scripts (instead of some generic error message that's more of a puzzle than any real help for debugging your code).
Rust and Data Science
When it comes to data science work, particularly machine learning and AI-related tasks, Rust has the potential of being a great asset. I say this, even though I'm vested in another high-performance language, Julia, for which I've written extensively (my books on Julia) and continue to use up to this day. However, unlike those fanboys of this or the other data science language, I'm open to new possibilities, which I'm always eager to explore. So, even though I'm a long way from being a Rust veteran, I can see its merit in our field.
So far, there are a few Rust packages for ML work, such as Smartcore and Linfa (plant juice in Italian), though, in all fairness, this codebase is nowhere near the variety and maturity of the likes of Scikit-learn in Python and the packages in the Julia ecosystem. Still, there is a lot of value Rust offers in this space, and as the community grows, we should be expecting to see the ML and A.I. libraries of Rust grow both in number and sophistication.
It may seem a bit too early to tell, but it's not far-fetched to say that Rust is here to stay and make it. While high-level languages like Python had nothing more to offer than simplicity and ease-of-use (probably the main reason they made it to the data science world), Rust is closer to modern languages like Julia and Nim, which offer a serious performance boost. Its business proposition is unquestionable, its adoption higher than many people expected, and its potential of making a dent in machine learning is hard to contest. Once you get past its eccentric programming style, you may begin to view it with the respect and fondness it deserves. So, check it out when you have a moment. Cheers!
Machine Learning is the field involved in using various algorithms that enable a machine (typically a computer) to learn from the data available, without making any assumptions about it. It includes multiple models, some simpler, others more advanced, that go beyond the statistical analysis of the data. Most of these models are black-boxes, though a few exhibit some interpretability. Yet, despite how well-defined this field is, several misconceptions about it conceal it in a veil of mystique.
First of all, machine learning is not the same as artificial intelligence (A.I.). There is an overlap, no doubt, but they are distinct fields. You can spend your whole life working in machine learning without ever using A.I. and vice versa. The overlap between the two takes the form of deep learning, the use of sophisticated artificial neural networks that are leveraged for machine learning tasks. Computer Vision is an area of application related to the overlap between machine learning and A.I.
What’s more, machine learning is not an extension of Statistics. Contrary to what many Stats fans say, machine learning is an entirely different field distinct from Statistics. There are similarities, of course, but they have fundamental differences. One of the key ones is that machine learning is data-driven, i.e., it doesn't use any mathematical model to describe the data at hand, while Statistics does just that. It's hard to imagine Statistical models without a data distribution or some function describing the mapping, while machine learning models can be heuristics-based instead.
Nevertheless, machine learning is not purely heuristics-based and, therefore, void of theoretical foundations. Even if it doesn't have the 200-year-old amalgamation of the Statistics theory, machine learning has some theoretical standing based on the few decades of research on its back. Many of its methods rely on heuristics that "just work," but it's not what people consider alchemy. Machine learning is a respectable scientific field with lots to offer both to the practitioner and the researcher.
Beyond the misconceptions mentioned earlier, there are additional ones that are worth considering. For example, machine learning is not plug-and-play, as some people think, no matter how intuitive the corresponding libraries are. What's more, machine learning is not always the best option for the problem at hand, since some projects are okay with something simple that's easy to understand and interpret. In cases like that, a statistical model would do just fine.
It's hard to do this topic justice in a single blog post, but hopefully, this has given you an idea of what machine learning is and what it isn't. I talk more about this subject in one of my most recent books, Julia for Machine Learning. Additionally, I plan to cover this topic in some depth in a 90-minute talk at the next Data Modeling Zone conference in Belgium this April. I hope to see you there! Cheers.
Like any project in an organization, a data science project needs to have boundaries regarding how many resources are allocated to it. This resource usage translates into a monetary cost that takes the form of a budget. So, even if many professionals in this field are not aware of it, budgeting plays an essential role in every data science project and provides the framework through which it can manifest.
Despite its similarities to other projects, data science projects differ in many ways. First of all, its return isn't clear or even guaranteed. A data science team may investigate a dataset for insights or predictive potential, but it may not dig up something worthwhile. The data in a company's databases may be useful for its day-to-day tasks but useless for anything data science-related. It's next to impossible to know if there is anything worthwhile beforehand, so when starting a project, you have to take significant risks. As for the project's time frame, that's also highly uncertain, especially if it's a new project. This uncertainty can veer the project off-course, and the risk of going over-budget is substantial.
When creating a budget for a data science project, several factors are considered to mitigate the risk of failure. First of all, you need to have a clear plan of what you expect from the data science team to find. If possible, you could have some ideas as to how you could translate these findings into a value-add, be it through a revenue stream, some improvement in the customer/user experience, or some enhancement in the organization's workflow efficiency.
What’s more, it's good to examine a data science project from various perspectives and ensure that the data scientists involved have peace of mind when working on it. It's not just up to them to make it work, since the other stakeholders have a responsibility in it too. For example, the data owners need to do their part and ensure that the data science teams receive all the data it needs promptly. The developers involved need to have sufficient bandwidth to help with any ETL and other project processes. Finally, the business people need to have realistic expectations of what the data science team can deliver and how to leverage their work.
Beyond the above factors that you need to consider, some additional considerations are useful to have when creating a budget. For instance, the cost of cloud computing involved is something that can get out of hand quickly, especially if you are not used to working at a particular scale of data. Sometimes it's more effective to have dedicated servers available to you instead of a leasing computing power in a virtual machine. Also, it would make sense to start with a proof-of-concept project to gain a better understanding of the problem at hand before going at it with full force.
You can learn more about this, and the other less technical aspects of data science work through one of the books I co-authored relatively recently. Namely, the Data Scientist Bedside Manner book delves into this topic and explores data science from a perspective few people consider. Using information from various sources, including some experienced professionals in the field, provides guidance both for the data-driven manager and the data science professional, bridging the gap between the two. Check it out when you have the chance. Cheers!
Non-Negative Matrix Factorization (NNMF or NMF) is a powerful method used in Recommender Systems, Topic Modeling in NLP, Image analysis, and various other areas. It involves breaking a matrix into a product of two other matrices, having either positive or zero values in them. The idea is to preserve meaning in the components derived since, in some cases, it doesn't make sense to have negative values in them (e.g., in Topic Modeling, you can't have a document with a negative membership to any given topic). NNMF is not an exact science, since it's an NP-hard problem. As a result, we try to find an approximate solution using various tricks. Although some people view NNMF as Stats-based, it is a machine learning technique based on Linear Algebra and Optimization.
The math of NNMF is relatively straight-forward, though not something you would do with pen and paper, or even a calculator! There are various approaches to NNMF, involving finding two matrices W and H, such that the norm of the original matrix X minus W*F is minimal, all while W and H being non-negative:
The norm part is the objective function of this minimization problem, by the way. Two common ways to accomplish this are the multiplicative update (gradually approximating W and H, using a particular rule), and the Hierarchical Alternating Least Squares (HALS) method, which attempts to find the columns of W one by one, through.
A common trick for NNMF is to use Singular Value Decomposition (SVD) first to find a rough approximation for W and H and then refine it gradually. Alternatively, we can use regularization to ensure that W and H's elements remain relatively small, resulting in a more stable solution. However, keep in mind that the solutions NNMF yields are approximate and correspond to local minima of the objective function.
Fortunately, there are programming libraries that do all the heavy-lifting for us when it comes to NNMF. One such library is the NMF.jl one, in Julia. If you are more of a Python user, you can use the NMF function from the decomposition class of the sklearn package. Both of these libraries are well-documented so getting the hand of them is relatively straight-forward.
You can learn more about data science methods like this one in my book, Data Science Mindset, Methodologies, and Misconceptions. In this book, I talk about all kinds of processes and techniques used in data science so that even the non-technical reader can grasp the intuition behind them and gain an understanding and appreciation of them. This book provides several external resources to go more in-depth on these topics and organize how you continue learning about this field, without getting lost in it. So, check it out when you have some time. Cheers!
Data science knowledge is vast and varied. It entails an in-depth understanding of data, the impact of models on this data, and various ways to refine the data making it more useful for these models. Also, it has to do with ways to depict this data graphically and make useful predictions based on it, using new data as inputs. Specialized data science knowledge also involves depicting this data in different ways (e.g. via a graph structure), gathering it from various sources (e.g. text), and creating interactive applications based on the data models built. Naturally, all this can be of value not just to data scientists but also to other data-related professionals. Let's examine how.
So, how can data science knowledge help data analysts and business intelligence professionals? After all, they are the closest to the role of a data scientist and deal with data in similar ways. These professionals can benefit from data science knowledge through a more in-depth understanding of the data, particularly when it comes to ETL processes and data wrangling. Also, for those more geared towards data models, they can learn more advanced models such as the machine learning models data scientists use and start using them in their work.
As for data modelers (data architects), data science knowledge can help those professionals too. After all, designing a useful information flow or implementing such a design into a database is closely linked to how this information is used. So, by understanding the potential different variables have (something that's bread and butter for a data scientist), a data modeler can optimize his work and build systems that are more future-tolerant. That is particularly useful in cases where the domain is dynamic, like in the e-commerce field.
Programmers can benefit a lot from data science knowledge too, especially those versed in OOP and functional languages (e.g. Julia and Scala). After all, data science involves a great deal of programming so there is a good overlap in the skill set of the two types of professionals. For this reason, many programmers end up getting into data science once they familiarize themselves with data science models, something they can do easily once they get exposed to data science knowledge.
Finally, data-driven managers have a lot to gain from data science knowledge, perhaps more than any other professional. The reason is that f you are involved in data-driven projects, you need to know what’s possible with the data you have and what kind of products or services you can build using this data. This is something you can do even without getting your hands dirty, by thinking in terms of data science. So, having some data science knowledge (particularly knowledge related to how data science is applicable and what data products look like), can go a long way. As a bonus, recruiting data scientists to implement your ideas is much easier if you are familiar with data science, something your recruits are bound to appreciate.
If you found this article interesting, you can learn more about data science and how it is leveraged in an organization through a book I co-authored last year. Namely, the Data Scientist Bedside Manner book covers this and similar topics thoroughly, along with some other practical knowledge on this subject. So, check it when you have the chance and spread the word about it to friends and colleagues in the aforementioned lines of work. Cheers!
The holiday season is upon us, something that translates for many of us to more free time. That’s why I decided to keep this article light and perhaps fun. After all, we all deserve a break after a year like this one! Regardless of your plans for the holidays, there are certain things you can do that are both enjoyable and educational.
For starters, if you are interested in programming (particularly recreational programming), you can check out the Exercism.io site. Exercism is an educational non-profit that aims to help people pick up a new programming language, including some of the more esoteric ones like bash. The site comprises of a series of exercises, some of which are on the language track while others are self-paced. As you proceed with the track, you unlock new exercises and explore new concepts in your language of choice. Also, there is some mentoring aid if you choose that option, helping you when you get stuck and/or showing you better ways to solve the exercises through useful tips and hints.
Another thing you can do is watch some videos on data science, or even take a course on the subject of data science or A.I. I know it may seem like a lot, but there is a lot of good material out there if you know where to look, which can help you augment your skills and know-how. Also, the cost of all this is fairly low, compared to what it used to be, so this sort of material is more accessible than ever before.
If you are up for more hands-on activities, you can play around with some data and do a mini data science project. Pick a dataset you are interested in and see what insights you can dig out from it. The project doesn't have to be 100% covered in explanatory text, but even without it, it can be good practice for you. Bonus points if you use a new technique or method.
Furthermore, you can check out some data science articles to be more up to speed on the latest trends or view certain topics from a different perspective. This blog can be a good place to start. Of course, if you want to read something that covers the subject in more depth, you can check out my data science books. You can find all of them at the publisher’s website, along with other technical books on similar subjects (esp. data modeling). Also, if you were to apply the coupon code DSML at the checkout, you can receive a 20% discount on whatever book you buy from that site.
So, there you have it. With these suggestions, you can now make good use of your time, without stressing. Besides, when you learn something this way, it tends to stick longer. Who knows, some of these activities may bear good fruits that you can leverage in the new year. Happy holidays!
The usefulness of JSON lies in the fact that it's versatile and relatively concise. What's more, it's faster than other similar file formats, while it's already widely used for web-related applications, making it easy to find mature programming libraries for it. Moreover, JSON is very intuitive, and many text editors have built-in functionality for viewing such files in an easy-to-read way. Furthermore, it's easy to create and edit JSON files yourself using a text editor, while programmatically, it's a walk in the park.
JSON’s compatibility with NoSQL databases is one of its fortes. Such systems include databases like MongoDB, which are quite popular in data science. Most new databases are also compatible with JSON as it's become a kind of standard. Additionally, JSON and the dictionary data structure go hand-in-hand, something vital in data science work. So, if you want to load some data from a JSON file, you can store it in a dictionary, while if you have a dataset (any dataset), you can code it as a dictionary (each variable being a key) and store it as a JSON file.
The JSON.jl library in Julia is one worth knowing about, especially if you want to use this programming language in your data science work. This fairly simple package enables you to parse and create JSON files, using the primitive Dict structure. A convenient library to know, even if it's still in version 0.21.x. JSON.jl makes use of the FileIO package on the back-end and its most useful functions are parse(), parsefile(), and print(). Note that the latter works different data structures, not just dictionaries.
The JSON file format is closely linked to APIs too. The latter are particularly useful in various data-related applications and are instrumental in certain data products developed by data scientists. Also, many APIs are essential for acquiring data, so knowing about them goes without saying. APIs are ideal for proof-of-concept projects, too, as they don't require too much work to get one up-and-running. As a result, they are a versatile tool for all sorts of projects, particularly those with a web presence.
The API Success book describes this technology in sufficient depth, without getting too technical. Besides, if you understand APIs' usefulness and how they fit into the bigger picture, it's not too hard to learn the technical aspects too, through a tutorial, for example. Note that you can get a 20% discount on this and any other book available at the publisher's website using the coupon code DSML. Using this code will also help me out, so you can see it as a way to support this blog. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.