For about a month now, I’ve been working on a new technical book for Technics Publications. This is a project that I've been thinking about for a while, which is why it took me so long to start. Just like my previous book, this one will be hands-on, and I'll be using Julia for all the code notebooks involved. Also, I'll be tackling a niche topic that hasn't been done before in this breadth and depth, in non-academic books. Because of this book, I won't be writing on this blog as regularly as before.
If you are interested in technical books from Technics Publications, as well as any other material made available from this place, you can use the DSML coupon code to get a 20% discount. This discount applies to most of the books there and the PebbleU subscriptions. So, check them out when you have a moment!
Recently a new educational video platform was launched on the web. Namely, Pebble U (short for Pebble University) made its debut as a way to provide high-quality knowledge and know-how on various data-related topics. The site is subscription-based, while it requires a registration for watching the videos and any other material available on it (aka pebbles). On the bright side, it doesn't have any vexing ads! Additionally, you can request a short trial of it, for some of the available material, before you subscribe to it. Win-win!
Pebble U has a unique selection of features that are very useful when consuming technical content. You can, for example, make notes and highlight parts and add bookmarks, on the books you read. As for the videos, many of them are accompanied by quizzes to embed your understanding of the topic covered. The whole platform is also available as an app for both Android and iOS devices.
The topics of Pebble U cover data science (particularly machine learning and A.I., though there are some Stats related videos too), Programming (particularly Python), and Business, among other categories. As the platform grows, it is expected to include additional topics and a larger number of content creators. All the videos are organized in meaningful groups called disciplines, making it easy to build on your knowledge. Of course, if you care for a particular discipline only, you can subscribe to material of that area only, saving you some money.
In the screenshot above, you can see some of my own material that are available on PebbleU right now. Many of them are from my Safari days, but there are also some newer ones, particularly on the topic of Cybersecurity.
By the way, if you find the subscription price a bit steep, remember that you can use the coupon code DSML I've mentioned in previous posts, to get a 20% discount. So, check it out when you have some time. This may be the beginning of something great!
I've talked about mentoring before and even mentioned it a few times in my books and videos. After all, it's an integral part of learning data science and A.I., among other fields. However, not all mentoring is created equal, and that's probably one of the most valuable lessons to learn in education. Unfortunately, to learn such a lesson you usually have to rely on your own experiences (since not many people want to talk about this matter).
Nowadays, everyone can sign up to particular sites and pretend to be a mentor. Sites like that often offer this for free since they know that charging for such a low-quality service would make them liable to lawsuits. However, the learner of data science and other fields often lacks the discernment to see such places for what they are in reality. Fortunately, however, there are much better alternatives.
Across the web, some sites provide proper mentoring, usually at a reasonable price, for all sorts of disciplines, including data science and A.I., among other fields. Many of these sites incorporate mentoring as part of their educational services, which include online classes too. However, that's not always the case. Someone can mentor you in your field of choice without having to follow a curriculum. This option is particularly appealing to professionals and people who have a relatively full schedule.
Proper mentoring involves various things, such as the following:
* career advice
* putting together a good resume or CV
* interview practice (particularly technical interviews)
* feedback on hands-on projects
Ideally, mentoring is a long-term process, though you can also opt for a handful of sessions to tackle specific problems you need help with. As long as you have an open mind, a willingness to learn, and value your mentor’s time, you are good to go. Naturally, bringing a specific task into the mentoring session can also be very useful, as it can help make it focused and productive.
By the way, if you wish to work with me as a mentee, I have some availability these months. What's more, I have set up a way to schedule these sessions efficiently using Calendly and have established collaboration with a freelance platform to handle payments and such (my Kwork link). So, if you are up for some proper mentoring, feel free to give me a buzz. Cheers!
Every year, there is a data modeling conference that takes place around the world. Its name is Data Modeling Zone, or DMZ for short (not to be confused with the DMZ in Korea, which isn't that good a place for data professionals!). Just like last year and the year before that, this year, I'll be participating in the conference as a speaker, talking about data science- and AI-related topics.
Namely, I'll talk about the common misconceptions about Machine Learning, something you may remember from my previous books. Still, this talk will cover the topic in more depth and help even newcomers to the field distinguish between the hype and the reality of machine learning. After my presentation, there will be some time for Q & A, so if you have any burning questions about this topic, you have a chance to have them answered.
Just like last year, DMZ is going to be online this year, making it super easy for you to attend, regardless of where you are. Also, there are plenty of interesting talks on various data-related topics, as you can see from the conference’s program.
I hope to see you there this November 18th!
I was never into Clustering. My Ph.D. was in Classification, and later on, I explored Regression on my own. I delved into unsupervised learning too, mostly dimensionality reduction, for which I've written extensively (even published papers on it). For some reason, Clustering seemed like a solved problem, and as one of my supervisors in my Ph.D. was a Clustering expert (he had even written books on this subject) I figured that there isn't much for me to offer there. Then I started mentoring data science students and dug deeper into this topic. At one point, I reached out to some data scientists I'd befriended over the years asking them this same question. The best responses I got were that DBSCAN is mostly deterministic (though not exactly deterministic if you look under the hood) and that K-means (along with its powerful variant, K-means++) was lightweight and scalable. So, I decided to look into this matter anew and see if I could clean up some of the dust it has accumulated with my BROOM.
Please note that when I started looking into this topic, I had no intention to show off my new framework nor to diminish anyone's work on this sub-fiend of data science. I have great respect for the people who have worked on Clustering algorithms, be it in research or their application-based work.
With all that out of the way, let's delve into it. First of all, deterministic Clustering is possible even if many data scientists will have you believe otherwise. One could argue that any data science algorithm can be done deterministically though this wouldn't be an efficient approach. That's why stochastic algorithms are in use, particularly in challenging problems like Clustering. There is nothing wrong with that. It's just frustrating when you get a different result every time you run the algorithm and have to set a random seed to ensure that it doesn't change the next time you use that code notebook where it lives. So, deterministic is an option, just not a popular one.
What about being lightweight? Well, if it's an algorithm that requires running a particular process again and again until it converges (like K-means), maybe it's lightweight, but probably not so much since it's time-consuming. Also, most algorithms worth their salt aren't as simple as K-means, which though super-efficient, leaves a lot to be desired. Let's not forget the assumptions it makes about the clusters and its reliance on distance, which tends to fail when several dimensions are present. So, in a multi-dimensional data space, K-means isn't a good option, and just like any other clustering algorithm, it struggles. DBSCAN struggles too, but for a different reason (density calculations aren't easy, and in multi-dimensional space, they are a real drag).
So, where does that leave us? Well, this is quite a beast that we have to deal with (the combination of a deterministic process and it being lightweight), so we'll need a bigger boat! We'll need an enormous boat, one armed with the latest weapons we can muster. Since we don't have the computational power for that, we'll have to make do with what we have, something that none of the other brilliant Clustering experts had at their disposal: BROOM. This framework can handle data in ways previously thought impossible (or at least unfeasible). High dimensionality? Check. Advanced heuristics for similarity? Check. An algorithm that features higher complexity without being computationally complex? Check. But the key thing BROOM yields that many Clustering experts would kill for is the initial centroids. Granted that they are way more than we need, it's better than nothing and better than the guesswork K-means relies on due to its nature.
In the toy dataset visualized above, I applied the optimal clustering method I've developed based on BROOM, there were two distinct groups in the dataset across the approximately 600 data points located on a Euclidean plane. Interestingly, their centers were almost the same, so K-means wouldn't have a chance to solve this problem, no matter how many pluses you put after its name. The initial centroids provided by BROOM were in the ballpark of 75, which is way too high. After the first phase of the algorithm, they were reduced to 7 (!) though even that number was too high for that dataset.
After some refinement, which took place in the second phase of the algorithm, they were reduced to 2. The whole process took less than 0.4 seconds on my 5-year-old laptop. The outputs of that Clustering algorithm included the labels, the centroids, the indexes of the data points of each cluster, the number of data points in each cluster, and the number of clusters, all as separate variables. Naturally, every time the algorithm was run it yielded the same results since it's deterministic.
Before we can generalize the conclusions that we can draw from this case study, we need to do further experimentation. Nevertheless, this is a step in the right direction and a very promising start. Hopefully, others will join me in this research and help bring Clustering the limelight it deserves, as a powerful data exploration methodology. Cheers!
OK, this title may sound a bit heavy, especially for this time of year. Let me break it down for you. There are various correlation metrics out there, which can handle two variables (let's call them x and y) and measure their relationship. More often than not, these metrics focus on the linear aspects of this relationship and are often confused by the non-linear ones. For example, a correlation metric like Pearson’s Correlation can tell you that a variable y defined as 2x + 5 is strongly correlated to x (a shocker, isn't it!) but if a variable z is defined as exp(x^2 + 1) were to be used instead, well Pearson's Correlation might struggle with that. A mathematician or even a Stats professional would assure you that there is a non-linear relationship between the two variables (x and z), but they'd have to rely on a plot of the two variables or some transformation of one of them (e.g., applying log() to z) if they were to measure this relationship. Things get even more complicated if the relationship is not as simple, e.g. that of x and a variable w defined as cos(x). Most likely, Pearson's correlation won't find anything there (a relationship close to 0), even though the Math or Stats professional mentioned previously would be sure there is a relationship there. So, what gives?
Well, what gives is a big question that if I were to answer it here, it would shake your belief in Stats like a super quake, similar to that which brought San Francisco down over a century ago. Interestingly, most Stats concepts are from around that same time, perhaps a bit older than that. So, you've got to give those guys a break since they didn't know any better, plus they didn't have the tools we have at our disposal. Given the circumstances, they did a pretty good job at defining the metrics they did and weaving the fabric of a theory around their methods. Come to think about it, if modern mathematicians were like them, we'd be reasoning in high-dimensional terms now, instead of relying on these old-fashioned formulas and techniques.
I propose a method based on the BROOM framework that looks into the non-linear and non-monotonous artifacts of a pair of variables to establish their relationship. This metric, which I call rbc (range-based correlation, as it's part of the ranges part of the framework), explores the two variables in an entirely data-driven manner, making no assumptions whatsoever about their distributions and their other aspects. As long as they are normalized, they are good to go. And this metric, contrary to all other correlation metrics I've tried, yields a correlation of 0.99 for the x-w pair and a similar figure for the x-z one. When you compare x with some random variable q (q belongs to the [0, 1] interval), it yields a weak correlation (usually between 0.1 and 0.2). As a result, we can deduce that it's a worthwhile metric for measuring the relationship between two variables, taking into account all non-linear artifacts while being unaffected by any lack of a monotonous pattern the two variables may exhibit. If you are interested in learning more, feel free to contact me. Cheers!
Everyone can analyze data these days, given the right programming tool and some library of functions, to express practically the relevant know-how of that person. I've seen people who give away books for free (as it would be impossible for them to get others to buy them) analyze data. As data science becomes more widespread, data analysis becomes a given for a larger portion of the population. But what about data synthesis, however? What's up with that? Let's delve into this.
First of all, let's get some definitions down. Data synthesis is the creation of synthetic data that follows a given pattern. The latter can be given directly to the data generation program, or it can be derived (extrapolated) via data analytics. Synthetic data is ideally indistinguishable from conventional data, and you can use it to train a data model, for example. However, there is something that makes it extremely valuable.
The value of synthetic data lies in the fact that it's not tied to particular individuals, so using this data doesn't pose any PII-related issues. Because of this, it cannot be owned by any specific person, even if it can be leveraged in the data science pipeline, yielding value. Naturally, since there are no shortcuts to value-making, the value (information) of that synthetic data must come from somewhere. So, since it's not practical for someone to have a high-level mathematical representation of this value and give it to a program as a pattern, it's more likely that this value stems from the source data.
So, to have valuable synthetic data (that's also free of PII), we need to have some source data of value, for starters. That's why the only practical way to generate synthetic data that's worth its space on a hard disk is via analytics. Of course, there are ways to generate such data through analytics, as in the case of some specialized deep learning networks (Autoencoders). The catch is that these A.I. systems require lots of data to do their job. After all, analyzing multiple variables isn't easy, even for an A.I. What if there was a way to perform the same task without employing these more advanced data-hungry systems?
Enter the BROOM framework again! We've already described some of its functionality, but what if this was just a prelude for its more sophisticated aspects? Well, fortunately, data synthesis isn't all that different from sampling, if you know what you are doing? And if you can sample a dataset properly, it's not that much more challenging to create new data points aligned with its essence. Naturally, the synthetic data is generated in a stochastic manner since it makes more sense to leverage noise in this process. Otherwise, all the generated data would be the same every time. Oh, and did I mention that this data synthesis process is scalable to as many dimensions as you like? Because if you understand data in-depth, the cardinality of vectors in a dataset is just another number...
I've been trying to answer this question for years. Well, not many years, but still, at least since the second half of the previous decade. Why? Well, I've always liked to explore the boundaries between the continuous and the discrete and since I finally internalized the teaching that everything in this universe is discrete (see: Quantum Physics), I decided to explore that angle and see if there was indeed a way to turn a continuous variable into a discrete one, with minimal information loss.
Over the past few months, I've developed three distinct approaches, depending on how distinct the values of the target variable are (see what I did there?). Let's start with something simple: no target variable at all. So, how can we discretize a continuous variable x? Well, you have to binarize it until there is no more binarization possible! But how do you optimally binarize a variable? That's something that involves densities after you handle all outliers and inliers in x, of course. How do you do that? Well, that's a topic that can fill a whole book chapter, so I'll have to draw the line here, I'm afraid.
What about when there is a target variable? Let's start with a binary one as it's simpler this way. We can employ a robust similarity metric that can assess the similarity of two binary variables, regardless of their alignment or any similarities due to chance. Fortunately, I've developed one such metric, which I call holistic symmetric similarity (HSS), which also works with all sorts of discrete variables. So, by using this metric, we can optimize the split to maximize the HSS score between the binarized x and the target variable y. The same approach works if y is discrete but not binary since I've generalized HSS to handle nominal variables too.
Ok, but what about when y is continuous, though? Well, that takes a bit more creativity since it's not as simple a task as it may seem. Fortunately, it's doable and relatively light, computationally speaking. We can find the threshold that maximizes a custom correlation metric that becomes larger once any non-linearities are tackled. This process doesn't have to be rocket science since I'm sure you can come up with a metric like that if you've been mentored by someone worth his salt in data science. Of course, you could use a translinearity correlation metric, yet, I wouldn't recommend that since it would inevitably pick up signals you wouldn't want it to, plus it's bound to be more computationally heavy.
So, there you have it. You can binarize and therefore discretize any feature x you like, with or without a target variable. The latter can be binary, discrete, or even continuous, depending on the problem at hand. Such a process can help you preserve computational resources and perhaps even enable you to make better and more transparent models (after all, binary variables tend to be easier on the mind, not just on the computer). All this I've done in the OD.jl script, which I cannot share here, unfortunately, as it has dependencies on proprietary code (the BROOM framework), which I'd rather not give away. Still, if you wish to explore this topic further, we can do that in a one-on-one mentoring session or two, given that you have the required commitment to the craft and a genuine interest to learn more about it. Cheers!
I realize that I’ve done this topic before, but perhaps it needs some more attention, as it’s a very useful topic.
Diagrams are great, but they are also challenging. As for the other graphics (particularly those not generated by a plotting library), these can be tough too. But both diagrams and these unconventional graphics are often essential in our line of work, be it as data scientists or data analysts. Let's examine the hows and whys of all this.
First of all, diagrams and graphics, in general, are a means of conveying information more intuitively. When you look at a table filled with numbers and other kinds of data, you need to think about them, and sometimes you have to know something about the context of all this. With diagrams, you may get an idea of the underlying information even if you don't know much about the context. Of course, the latter can help bring about scope and perspective, helping you interpret the diagram better and make it more applicable to the task at hand.
Diagrams and unconventional graphics are paramount in presentations too. Imagine going to a client or a manager with just a code notebook at your disposal! Even if they may appreciate you having done all this work chances are that you'll need more than that to get them on your side and see the real value behind all these ones and zeros! Besides, the adage of "a picture is a thousand words" is valid, even in Analytics work. Data modelers have figured that out a long time ago, which is why diagrams are their bread and butter. Perhaps there is something to be learned from all this.
But how do you go about creating diagrams and unconventional graphics in general? After all, graphics design is a challenging discipline, and it's not realistic to try to do this kind of work without lots of studying and practicing. Also, it's doubtful we'll ever be as good as graphics professionals who often have the talent to drive their know-how. Still, we can learn some basics and create decent-looking diagrams and graphics, to facilitate our data science endeavors.
For starters, we can invest in learning a program like GIMP. This software is an open-source version of Photoshop, and it's well-established and documented. So, if you have a good image or graphic to work with, GIMP can make it shine. Also, programs like LibreOffice Draw can be practically essential for this sort of work, especially if you want to build something from scratch.
Contrary to what some people think, creating graphics is very detailed work, not some artistic endeavor. You need to use both your analytical and your imaginative faculties for such a task, even if the imagination part may seem dominant, at least in the beginning. So, for any graphics-related tasks, remember, zooming in is your friend! As for the properties box of any graphical object, that's your best friend!
Anyway, I could go on and talk about graphics in data science and data analytics work all day, but it’s not possible to do this topic justice in a single blog post. Besides, the best way to learn is by practicing, just like when it comes to building and refining data models for your Analytics work. Cheers!
Code notebooks have become a necessity for anyone doing any serious work in this field. Although they don't have the same functionality as IDEs, they are instrumental, especially if you want to showcase your code. Also, it's something that has evolved quite a bit over the years. This notebook covered in this article is seemingly the latest development of this evolution, at least for the Julia programming language, in data science work.
Don’t You Mean Jupyter Notebooks?
I understand why someone would think that. After all, Jupyter notebooks are well-established, and I've often used them myself over the years. However, this code notebook I write about is an entirely different animal (you could say another world altogether!). Neptune notebooks have little to do with Jupyter ones, so if you go to them expecting them to be just like Jupyter but better, you'd be disappointed. However, if you see them for what they are, you may be in for a surprise.
A Voyage through the Solar System of Code Notebooks
Neptune notebooks are essentially Julia scripts rendered on a web browser. At the time of this writing, this browser is usually Chrome or some variant of it (e.g., Chromium) and Firefox. However, The latter browser isn't ideal for particular tasks, such as printing (even if it's PDF printing).
A Neptune notebook can run on Julia, even if you don't have the Neptune.jl library installed. However, if you do, you can load the notebook on your browser and have Julia running in the background, just like in the Jupyter notebooks. However, unlike its more established brother, the Neptune notebook only supports Julia, particularly the later versions of the language. Also, the layout is quite different, and at first, it may seem off-putting if you are used to the elegance and refined interface of Jupyter notebooks.
Neptune notebooks are rudimentary, perhaps even minimalist, compared to Jupyter ones. However, because of this, they are far more stable and efficient. In fact, in the few months that I've been using Neptune notebooks, I've never had it crash on me, not once. Also, authentication errors are rare and only happen if you try to run a Neptune notebook on both Firefox and Chromium simultaneously. The notebook seems to lock onto the browser you start it in, usually the default browser. Wrinkles like this will hopefully be ironed out in future versions of the Neptune library. Despite them, however, these notebooks are quite slick and their support of markdown is a noteworthy alternative to the text cells of Jupyter notebooks. Perhaps overall, they are geared towards the more advanced users at the time of this writing. Hopefully, this is something that will change if more data scientists embrace this technology.
By the way, technically Neptune is a form of the Pluto.jl library, which enables Pluto notebooks. However, the latter, although quite interesting, aren't designed for data science work and I'd not recommend them. If, however, you are a Julia programmer who wants to use something different, Pluto is a good alternative. Just don't try to do a data science project on them before getting some insurance policy on your computer, since it's likely you are going to physically break it, out of frustration!
The development of code notebooks is fascinating, and Neptune seems to be a respectable addition to all this. As there aren't any decent tutorials at the moment that I can point you to, I suggest you play around with such a notebook yourself to see if it does it for you. If you want to see one such code notebook in action, you can check out my Anonymization and Pseudonymization course I've published relatively recently on WintellectNow. All the coding in it is on a Neptune notebook, which is the hands-on part of that mini-course. Cheers.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.