Short answer: Nope! Longer answer: clustering can be a simple deterministic problem, given that you figure out the optimal centroids to start with. But isn’t the latter the solution of a stochastic process though? Again, nope. You can meander around the feature space like a gambler, hoping to find some points that can yield a good solution, or you can tackle the whole problem scientifically. To do that, however, you have to forget everything you know about clustering and even basic statistics, since the latter are inherently limited and frankly, somewhat irrelevant to proper clustering.
Finding the optimal clusters is a two-fold problem: 1. you need to figure out which solutions make sense for the data (i.e. a good value for K), and 2. figure out these solutions in a methodical and robust manner. The former has been resolved as a problem and it’s fairly trivial. Vincent Granville talked about it in his blog, many years ago and since he is better at explaining things than I am, I’m not going to bother with that part at all. My solution to it is a bit different but it’s still heuristics-based. The 2nd part of the problem is also the more challenging one since it’s been something many people have been pursuing a while now, without much success (unless you count the super slow method of DBSCAN, with more parameters than letters in its name, as a viable solution).
To find the optimal centroids, you need to take into account two things, the density of each centroid and the distances of each centroid to the other ones. Then you need to combine the two in a single metric, with you need to maximize. Each one of these problems seems fairly trivial, but something that many people don’t realize is that in practice, it’s very very hard, especially if you have multi-dimensional data (where conventional distance metrics fail) and lots of it (making the density calculations a major pain). Fortunately, I found a solution to both of these problems using 1. a new kind of distance metric, that yields a higher U value (this is the heuristic used to evaluate distance metrics in higher dimensional space), though with an inevitable compromise, and 2. a far more efficient way of calculating densities. The aforementioned compromise is that this metric cannot guarantee that the triangular inequality holds, but then again, this is not something you need for clustering anyway. As long as the clustering algo converges, you are fine.
Preliminary results of this new clustering method show that it’s fairly quick (even though it searches through various values of K to find the optimum one) and computationally light. What’s more, it is designed to be fairly scalable, something that I’ll be experimenting with in the weeks to come. The reason for the scalability is that it doesn’t calculate the density of each data point, but of certain regions of the dataset only. Finding these regions is the hardest part, but you only need to do that once, before you start playing around with K values.
Anyway, I’d love to go into detail about the method but the math I use is different to anything you’ve seen and beyond what is considered canon. Then again, some problems need new math to be solved and perhaps clustering is one of them. Whatever the case, this is just one of the numerous applications of this new framework of data analysis, which I call AAF (alternative analytics framework), a project I’ve been working on for more than 10 years now. More on that in the coming months.
Puzzles, especially programming related ones, can be very useful to hone one’s problem-solving skills. This has been known for a while among coders, who often have to come up with their own solutions to problems that lend themselves to analytical solutions. Since data science often involves similar situations, it is useful to have this kind of experiences, albeit in moderation.
Programming puzzles often involve math problems, many of which require a lot of specialized knowledge to solve. This sort of problems, may not be that useful since it’s unlikely that you’ll ever need that knowledge in data science or any other applied field. Number theory, for example, although very interesting, has little to do with hands-on problems like the ones we are asked to solve in a data science setting.
The kind of problems that benefit a data scientist the most are the ones where you need to come up with a clever algorithm as well as do some resource management. It’s easy to think of the computer as a system of infinite resources but that’s not the case. Even in the case of the cloud, where resource limitations are more lenient, you still have to pay for them, so it’s unwise to use them nilly-willy unless you have no other choice.
Fortunately, there are lots of places on the web where you can find some good programming challenges for you. I find that the ones that are not language-specific are the best ones since they focus on the algorithms, rather than the technique.
Solving programming puzzles won’t make you a data scientist for sure, but if you are new to coding or if you can use coding to express your problem-solving creativity, that’s definitely something worth exploring. Always remember though that being able to handle a dataset and build a robust model is always more useful, so budget your time accordingly.
Although there is a plethora of heuristics for assessing the similarity of two arrays, few of them can handle different sizes in these arrays and even fewer can address various aspects of these pieces of data. PCM is one such heuristic, which I’ve come up with, in order to answer the question: when are two arrays (vectors or matrices) similar enough? The idea was to use this metric as a proxy for figuring out when a sample is representative enough in terms of the distributions of its variables and able to reflect the same relationships among them. PCM manages to accomplish the first part.
PCM stands for Possibility Congruency Metric and it makes use primarily of the distribution of the data involved as a way to figure out if there is congruency or not. Optionally, it uses the difference between the mean values and the variance too. The output is a number between 0 and 1 (inclusive), denoting how similar the two arrays are. The higher the value, the more similar they are. Random sampling provides PCM values around 0.9, but more careful sampling can reach values up closer to 1.0, for the data tested. Naturally, there is a limit to how high this value can get in respect with the sample size, because of the inevitable loss of information through the process.
PCM works in a fairly simple and therefore scalable manner. Note that the primary method focuses on vectors. The distribution information (frequencies) is acquired by binning the variables and examining how many data points fall into each bin. The number of bins is determined by taking the harmonic mean of the optimum numbers of bins for the two arrays, after rounding it to the closest integer. Then the absolute difference between the two frequency vectors is taken and normalized. In the case of the mean and variance option being active, the mean and variance of each array are calculated and their absolute differences are taken. Then each one of them is normalized by dividing with the maximum difference possible, for these particular arrays. The largest array is always taken as a reference point.
When calculating the PCM of matrices (which need to have the same number of dimensions), the PCM for each one of their columns is calculated first. Then, an average of these values is taken and used as the PCM of the whole matrix. The PCM method also yields the PCMs of the individual variables as part of its output.
PCM is great for figuring out the similarity of two arrays through an in-depth view of the data involved. Instead of looking at just the mean and variance metrics, which can be deceiving, it makes use of the distribution data too. The fact that it doesn’t assume any particular distribution is a plus since it allows for biased samples to be considered. Overall, it’s a useful heuristic to know, especially if you prefer an alternative approach to analytics than what Stats has to offer.
Recently I attended JuliaCon 2018, a conference about the Julia language. There people talked about the various cool things the language has to offer and how it benefits the world (not just the scientific world but the other parts of the world too). Yet, as it often happens to open-minded conferences like this one, there are some unusual ideas and insights that float around during the more relaxed parts of the conference. One such thing was the Nim language (formerly known as Nimrod language, a very promising alternative to Julia), since one Julia user spoke very highly of it.
As I’m by no means married to this technology, I always explore alternatives to it, since my commitment is to science, not the tools for it. So, even though Julia was at an all-time high in terms of popularity that week, I found myself investigating the merits of Nim, partly out of curiosity and partly because it seemed like a more powerful language than the tools that dominate the data science scene these days.
I’m still investigating this language but so far I’ve found out various things about it that I believe they are worth sharing. First of all, Nim is like C but friendlier, so it’s basically a high-level language (much like Julia) that exhibits low-level language performance. This high performance stems from the fact that Nim code compiles to C, something unique for a high-level language.
Since I didn’t know about Nim before then, I thought that it was a Julia clone or something, but then I discovered that it was actually older than Julia (about 4 years, to be exact). So, how come few people have heard about it? Well, unlike Julia, Nim doesn’t have a large user community, nor is it backed up by a company. Therefore, progress in its code base is somewhat slower. Also, unlike Julia, it’s still in version 0.x (with x being 18 at the time of this writing). In other words, it’s not considered production ready.
Who cares though? If Nim is as powerful as it is shown to be, it could still be useful in data science and A.I., right? Well, theoretically yes, but I don’t see it happening soon. The reason is three-fold. First of all, there are not many libraries in that language and as data scientists love libraries, it’s hard for the language to be anyone’s favorite. Also, there isn’t a REPL yet, so for a Nim script to run you need to compile it first. Finally, Nim doesn’t integrate with popular IDEs such as Jupyter and Atom, and as data scientists love their IDEs, it’s quite difficult for Nim to win many professionals in our field without IDE integration.
Beyond these reasons, there are several more that make Nim an interesting but not particularly viable option for a data science / A.I. practitioner. Nevertheless, the language holds a lot of promise for various other applications and the fact that it’s been around for so long (esp. considering that it exists without a company to support its development) is quite commendable. What’s more, there is at least one book out there on the language, so there must be a market for it, albeit a quite niche one.
So, should you try Nim? Sure. After all, the latest release of it seems quite stable. Should you use it for data science or A.I. though? Well, unless you are really fond of developing data science / A.I. libraries from scratch, you may want to wait a bit.
It seems like yesterday when I came up with this encryption system, for which I even wrote on this blog about. I never expected to create a video on it, but what better way to share it with the world, at least its core aspects of it. As there is no reason why I'd consider my implementation of this idea the best possible, I leave the viewer to experiment on his/her own on that matter, after I explain each aspect of the method and showcase a couple of examples of it. Anyway, check out the video on Safari when you get the chance and let me know here what you think of it. Enjoy!
Why Articles on Social Media about Programming for Data Science Seem to Be Straight Out of a Time Capsule
Data science related topics sell, no doubt about that. This is great is you are interested in the field and want to learn more about it, especially practical things that can offer you some orientation in the field. Since programming is a key component of data science, it makes sense to pay attention to material along these lines, particularly if you are new to this whole matter.
How the Situation Is Today
Fortunately there is an abundance of articles on this topic, especially on the social media. However, not everyone who writes such articles is up-to-date on this subject since many of these “expert” tech writers are not forward thinking data scientists themselves. Best case scenario, they have spend a few minutes on the web, probably focusing on the results on the first page of a search engine for the bulk of their material. And shocking as it may be, this material may be geared more towards what’s more popular rather on what’s more accurate. Alternatively, they may have relied on what some data science guru once said on the topic, information that may no longer be particularly relevant. Apart from that, the writers who delve into the production of this sort of articles (or infographics in some cases) have their own biases. Probably they took a programming course at university so if a particular programming platform comes up on their “research” they may be more likely to highlight it. After all, this would make them knowledgeable since they have hands-on experience on that platform, even if it’s not that useful to data science any more. What’s more, many people who write about these topics don’t want to take risks with newer things. It’s much safer to mention languages that everyone knows about and which have a large community around them, than mention newer ones that may be despised by the hardcore users of older coding platforms.
Hope for the Future
For better or for worse, an article on the social media has a limited life span. After all, its purpose is mainly to get enough people to click on a particular link where a given site serves ads, so that the people owning the site can get some revenue from said ads. Therefore, if the article is forgotten in a week, its producers won’t lose any sleep over it. Books and subscription-based videos are not like that though. Neither are technical conferences. So, since the new trends are geared more towards this kind of platforms to become well-known, they are not that much hindered by social media misinformation. After all, if a programming language is good, this is something that will eventually show, even if the fan-boys of the more traditional languages would sooner die than change their views on their favorite coding platforms.
What You Can Do
So, instead of getting swayed by this or the other “expert” with X thousand followers (many of whom are probably either bots or bought followers), you can do your own research. Check out what books are out there on the various programming languages and if they hint towards applicability in data science. Check out videos on Safari and other serious educational platforms. Look at what new language conferences are out there and how they cover data science related topics. And most importantly, try some of these languages yourself. This way you’ll have some more reliable data when making a decision on what language is most relevant and most future-proof in our field, rather than blindly believe whatever this or the other “expert” on the social media says.
JuliaRun is Julia’s latest cloud-based version. In my book, Julia for Data Science, I’ve mentioned that there is an online version of the language, called JuliaBox. This version uses Jupyter as its front-end and runs on the cloud. JuliaRun is the next version of JuliaBox, still using Jupyter, but also offering various scalability options. JuliaRun is powered by the Microsoft cloud, aka Azure. However, there is an option of running it on your own cluster (ask the Julia Computing people for details).
Signing in JuliaRun is a fairly simple process. You just need to use either your GitHub credentials or your Google account. It’s not clear why someone has to be tied to an external party instead of having a Julia Computing account, but since creating a Google account is free, it's not a big issue! Also, it is a bit peculiar that JuliaRun doesn’t support Microsoft credentials, but then again, a MS account is not as popular as these other two sign-in options.
After you sign in, you need to accept the Terms of Service, a fairly straight-forward document, considering that it is a legal one. The most useful take-away from it is that if you leave your account inactive for about 4 months, it’s gone, so this is not for people who are not committed to using it.
Once you accept the ToS, you are taken to an IJulia directory, on Jupyter. This is where all your code notebooks are stored. The file system has a few things there already, the most noteworthy of which being a few tutorials. These are very helpful to get you started and also to demonstrate how Julia works in this platform. If you’ve never used IJulia before, there are also a good guide for that. Note that IJulia can run on Jupyter natively too, once you install the IJulia package and the Jupyter platform, on your machine.
Kernel and Functionality
The Julia version being used on JuliaRun is the latest stable release, which at the time of this writing is 0.6. However, the kernel version may differ for certain notebooks (e.g. for the JuliaTutorial one, it’s 0.5.2). Still, the differences between the last couple of versions are minute, for the most part. I’d recommend you go through the tutorials and also create some of your own test notebooks, before starting on a serious project, unless of course you use IJulia already on your computer.
Adding packages is fairly straight-forward, though it can be time-consuming as a process, especially if you have a lot of packages to install. Also, you have the option of installing a package in either one of the two latest versions of the language, or both, if you prefer. If you are more adventurous, you can even installed an unregistered package, by providing the corresponding URL.
You can also add code to JuliaRun through a Git repository (not necessarily GitHub). You just need to specify the URL of the repository, the branch, and which folder on JuliaBox you want to clone it in.
JuliaRun also offers a brief, but useful, help option. It mainly consists of a few FAQs, as well as an email address for more specialized questions. This is probably better than the long help pages in some other platforms that are next to impossible to navigate and are written by people who are terribly at writing. The help on this platform is brief, but comprehensive and with the user in mind.
For those who are closer to the metal and prefer the direct interaction with the Julia kernel, rather than the IJulia notebook interface, there is also the option to start a terminal. You can access that via the New button at the directory page.
From what I’ve seen of JuliaRun, both through a demo from the Julia team, and through my own experience, it is fairly easy to use. What I found very useful is that it doesn’t require any low-level data engineering expertise, though if you are good at working the processes of a cloud platform through the terminal or via Jupyter, that’s definitely useful. However, if you are someone more geared towards the high-level aspects of the craft, you can still do what you need to do, without spending too much time on the configurations.
I’d love to write more about this great platform that takes Julia to the next level, but this post is already too long. So, whenever you have a chance, give it a try and draw your own conclusions about this immensely useful tool.
People like to argue, especially about things they can reason with. However, just because you can justify that your view has merit, giving some practical examples or through logical reasoning, this doesn't make alternative views invalid. If there are several programming languages in data science, perhaps an oversimplification like “X is the best language for data science because Y” doesn't hold much water. Let’s examine why.
Although it is possible to rule out certain languages (e.g. Assembly or C) as optimal for data science, this doesn't mean that the problem has a clear-cut solution. Also, the assumption that a single programming language can cover all the use cases of a data science professional is a quite unjustifiable one. Some data scientists use two or three programming languages, sometimes in combination, getting the best of each, for optimal overall performance.
Also, data science is all about solving a business problem in a scientific manner. Just because say Dr. Smith prefers to use language X over Y, it doesn't mean that you have to follow her example. Maybe she has used language X during her PhD and didn't have time to learn another language, or she attained mastery of that language, so she feels more comfortable doing her data science work with that. She may be a successful data scientist but following her programming habits won’t make you a great data scientist necessarily.
Moreover, with new languages and new packages in the existing languages coming about all the time, which language is best is like the best performing basketball team. Definitely not something particularly stable! Besides, it’s often the case that a particular project may requite special handling, so what is a top-performer now, may not be the best option for that particular case.
In addition, the almost religious attitude towards programming languages that many people have (not just data scientists) is by itself problematic. If a potential employer sees you arguing about how your language of choice is the best and that you are not open to consider alternatives, he may not be so eager to hire you, since this kind of attitude creates disharmony and difficulty in collaboration among the members of a team. Besides, in most companies nowadays, they rarely ask for a specific language in the candidate requirements. As long as you can do the task that’s required of you, they don’t really care much what your programming background is. Of course companies that have already invested in a particular language and have all their code in that language may not be so flexible, but that shouldn't be the principle factor in your decision about which language you learn.
Finally, when it comes to deep learning, many modern frameworks, like Apache’s MXNet, have APIs for a variety of programming language. So if your A.I. guru friend tries to convince you that you should learn language X because that’s the best deep learning language, take that suggestion with a pinch of salt!
The important thing is for whatever language you decide to learn for data science, you make sure that you learn it well. Familiarize yourself with its packages, use it to solve various problems, and learn the best strategies for debugging code written in that language. If you do that, you can still make good use of it for your data science projects, even if the majority of people prefer this or the other language instead.
People nowadays, especially those who don’t understand programming, tend to be opinionated about programming languages and harbor unrealistic expectations. It’s this kind of people who spill negativity towards promising projects like Julia, which are still in the process of development. The same people would probably say nasty things about Python, or R, if these languages were developed in a time when early releases of them were accessible to the world through the Internet. So, perhaps it’s not really Julia these people have an issue with...
It’s easy to criticize something, be it a book, a movie, or a programming language. It’s probably the easiest thing someone can do, other than doing nothing. However, doing nothing doesn't hurt anyone, while the negativity of criticism has a corrosive effect on whoever is exposed to it. It would be overly idealistic to think that people who have this nasty habit could be cured of it, since most likely there are deep issues that cause it to manifest, which would probably require professional help to remedy. What can be remedied fairly easily though is the effect of these criticisms, since they are based on some shallow opinion rather than facts.
So, if you have heard someone who has spent a few hours learning about Julia and trying it out on his laptop dis Julia, that’s not a view you need to take very seriously. Just like every programming language, Julia has its issues and the packages out there are not in their final form. Just because something doesn't have the maturity and elegance of Pandas or Scikit-learn, it doesn't make it useless though. Julia, unlike other high-level languages, enables its users to make their own scripts easily and ensure high performance in them. Imagine trying to do that in Python! You’d need to be a computer science expert in order to guarantee high performance in a script you just put together and most likely you’d need to make use of C at one point it (Cython).
However, just because some people love Julia and swear by it, you shouldn't take their word for it. The idea is that you try it out yourself, like you’d try some other language, namely through methodical studying and practice. After you've spent quite some time and have developed your own (working) programs in it, then you can have a valid opinion on it. And if you don’t like it, that’s fine. Most Julia users don’t take offense if you don’t like their favorite language. However, since these people don’t dis your language of choice, I believe it is only fair if you show some respect for their favorite language. After all, Julia is not competing with any other language. It just does its thing, like Swift, and other fairly new programming languages.
Perhaps Julia is not the language of choice for the majority of data science practitioners. That’s perfectly fine. Just because it’s not as mature as Python or R, however, it doesn't mean that it’s not useful. Also, as it’s still in its early stages of development, it can only improve as time goes by. Till then, you can always use it for specific tasks, parallel to your language of choice. After all, there are bridge packages that enable that, which is more that someone could say about some other new languages, like Go.
If I've tried to make the argument that Julia is a great programming language, that’s because I find new technologies interesting and useful for an ever-changing field, such as data science. It was never my intention to convert anyone to that language, merely make it more well-known. After all, data science is all about mindset and methodologies, not so much about the specific tools, which inevitably change over time.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.