Lately, I've been busy with preparations for my conference trips, hence my online absence. Nevertheless, I found time to write something for you all who keep an open mind to non-hyped data science and A.I related content. So, this time I'd like to share a few thoughts on programming for data science, from a somewhat different perspective.
First of all, it doesn't matter that much what language you use, if you have attained mastery of it. Even sub-Julia languages can be useful if you know how to use them well. However, in cases where you use a less powerful language, you need to know about lambda functions. I mastered this programming technique only recently because in Julia the performance improvement is negligible (unless your original code is inefficient to start with). However, as they make for more compact scripts, it seems like useful know-how to have. Besides, they have numerous uses in data science, particularly when it comes to:
Another thing that I’ve found incredibly useful, and which I mastered in the past few weeks, is the use of auxiliary functions for refactoring complex programs. A large program is bound to be difficult to comprehend and maintain, something that often falls into the workload of someone else you may not have a chance to help out. As comments in your script may also prove insufficient, it’s best to break things down to smaller and more versatile functions that are combined in your wrapper function. This modular approach, which is quite common in functional programming, makes for more useful code, which can be reused elsewhere, with minor modifications. Also, it’s the first step towards building a versatile programming library (package).
Moreover, I’ve rediscovered the value of pen and paper in a programming setting. Particularly when dealing with problems that are difficult to envision fully, this approach is very useful. It may seem rudimentary and not something that a "good data scientist" would do, but if you think about it, most programmers also make use of a whiteboard or some other analog writing equipment when designing a solution. It may seem like an excessive task that may slow you down, but in the long run, it will save you time. I've tried that for testing a new graph algorithm I've developed for figuring out if a given graph has cycles (cliques) in it or not. Since drawing graphs is fairly simple, it was a very useful auxiliary task that made it possible to come up with a working solution to the problem in a matter of minutes.
Finally, I discovered again the usefulness of in-depth pair-coding, particularly for data engineering tasks. Even if one's code is free of errors, there are always things that could use improvement, something that can be introduced through pair-coding. Fortunately, with tools like Zoom, this is easier than ever before as you don't need to be in the same physical room to perform this programming technique. This is something I do with all my data science mentees, once they reach a certain level of programming fluency and according to the feedback I've received, it is what benefits them the most.
Hopefully, all this can help you clarify the role of programming in data science a bit more. After all, you don't need to be a professional coder to make use of a programming language in fields like data science.
Throughout this blog, I've talked about all sorts of problems and how solving them can aid one's data science acumen as well as the development of the data science mindset. Problem-Solving skills rank high when it comes to the soft skills aspect of our craft, something I also mentioned in my latest video on O'Reilly. However, I haven't talked much about how you can hone this ability.
Enter Brilliant, a portal for all sorts of STEM-related courses and puzzles that can help you develop problem-solving, among other things. If you have even a vague interest in Math and the positive Sciences, Brilliant can help you grow this into a passion and even a skill-set in these disciplines. The most intriguing thing about all this is that it does so in a fun and engaging way.
Naturally, most of the stuff Brilliant offers comes with a price tag (if it didn't, I would be concerned!). However, the cost of using the resources this site offers is a quite reasonable one and overall good value for money. The best part is that by signing up there you can also help me cover some of the expenses of this blog, as long as you use this link here: www.brilliant.org/fds (FDS stands for Foxy Data Science, by the way). Also, if you are among the first 200 people to sign up you'll get a 20% discount, so time is definitely of the essence!
Note that I normally don't promote anything of this blog unless I'm certain about its quality standard. Also, out of respect for your time I refrain from posting any ads on the site. So, whenever I post something like this affiliate link here I do so after careful consideration, opting to find the best way to raise some revenue for the site all while providing you with something useful and relevant to it. I hope that you view this initiative the same way.
Short answer: Nope! Longer answer: clustering can be a simple deterministic problem, given that you figure out the optimal centroids to start with. But isn’t the latter the solution of a stochastic process though? Again, nope. You can meander around the feature space like a gambler, hoping to find some points that can yield a good solution, or you can tackle the whole problem scientifically. To do that, however, you have to forget everything you know about clustering and even basic statistics, since the latter are inherently limited and frankly, somewhat irrelevant to proper clustering.
Finding the optimal clusters is a two-fold problem: 1. you need to figure out which solutions make sense for the data (i.e. a good value for K), and 2. figure out these solutions in a methodical and robust manner. The former has been resolved as a problem and it’s fairly trivial. Vincent Granville talked about it in his blog, many years ago and since he is better at explaining things than I am, I’m not going to bother with that part at all. My solution to it is a bit different but it’s still heuristics-based. The 2nd part of the problem is also the more challenging one since it’s been something many people have been pursuing a while now, without much success (unless you count the super slow method of DBSCAN, with more parameters than letters in its name, as a viable solution).
To find the optimal centroids, you need to take into account two things, the density of each centroid and the distances of each centroid to the other ones. Then you need to combine the two in a single metric, with you need to maximize. Each one of these problems seems fairly trivial, but something that many people don’t realize is that in practice, it’s very very hard, especially if you have multi-dimensional data (where conventional distance metrics fail) and lots of it (making the density calculations a major pain). Fortunately, I found a solution to both of these problems using 1. a new kind of distance metric, that yields a higher U value (this is the heuristic used to evaluate distance metrics in higher dimensional space), though with an inevitable compromise, and 2. a far more efficient way of calculating densities. The aforementioned compromise is that this metric cannot guarantee that the triangular inequality holds, but then again, this is not something you need for clustering anyway. As long as the clustering algo converges, you are fine.
Preliminary results of this new clustering method show that it’s fairly quick (even though it searches through various values of K to find the optimum one) and computationally light. What’s more, it is designed to be fairly scalable, something that I’ll be experimenting with in the weeks to come. The reason for the scalability is that it doesn’t calculate the density of each data point, but of certain regions of the dataset only. Finding these regions is the hardest part, but you only need to do that once, before you start playing around with K values.
Anyway, I’d love to go into detail about the method but the math I use is different to anything you’ve seen and beyond what is considered canon. Then again, some problems need new math to be solved and perhaps clustering is one of them. Whatever the case, this is just one of the numerous applications of this new framework of data analysis, which I call AAF (alternative analytics framework), a project I’ve been working on for more than 10 years now. More on that in the coming months.
Puzzles, especially programming related ones, can be very useful to hone one’s problem-solving skills. This has been known for a while among coders, who often have to come up with their own solutions to problems that lend themselves to analytical solutions. Since data science often involves similar situations, it is useful to have this kind of experiences, albeit in moderation.
Programming puzzles often involve math problems, many of which require a lot of specialized knowledge to solve. This sort of problems, may not be that useful since it’s unlikely that you’ll ever need that knowledge in data science or any other applied field. Number theory, for example, although very interesting, has little to do with hands-on problems like the ones we are asked to solve in a data science setting.
The kind of problems that benefit a data scientist the most are the ones where you need to come up with a clever algorithm as well as do some resource management. It’s easy to think of the computer as a system of infinite resources but that’s not the case. Even in the case of the cloud, where resource limitations are more lenient, you still have to pay for them, so it’s unwise to use them nilly-willy unless you have no other choice.
Fortunately, there are lots of places on the web where you can find some good programming challenges for you. I find that the ones that are not language-specific are the best ones since they focus on the algorithms, rather than the technique.
Solving programming puzzles won’t make you a data scientist for sure, but if you are new to coding or if you can use coding to express your problem-solving creativity, that’s definitely something worth exploring. Always remember though that being able to handle a dataset and build a robust model is always more useful, so budget your time accordingly.
Although there is a plethora of heuristics for assessing the similarity of two arrays, few of them can handle different sizes in these arrays and even fewer can address various aspects of these pieces of data. PCM is one such heuristic, which I’ve come up with, in order to answer the question: when are two arrays (vectors or matrices) similar enough? The idea was to use this metric as a proxy for figuring out when a sample is representative enough in terms of the distributions of its variables and able to reflect the same relationships among them. PCM manages to accomplish the first part.
PCM stands for Possibility Congruency Metric and it makes use primarily of the distribution of the data involved as a way to figure out if there is congruency or not. Optionally, it uses the difference between the mean values and the variance too. The output is a number between 0 and 1 (inclusive), denoting how similar the two arrays are. The higher the value, the more similar they are. Random sampling provides PCM values around 0.9, but more careful sampling can reach values up closer to 1.0, for the data tested. Naturally, there is a limit to how high this value can get in respect with the sample size, because of the inevitable loss of information through the process.
PCM works in a fairly simple and therefore scalable manner. Note that the primary method focuses on vectors. The distribution information (frequencies) is acquired by binning the variables and examining how many data points fall into each bin. The number of bins is determined by taking the harmonic mean of the optimum numbers of bins for the two arrays, after rounding it to the closest integer. Then the absolute difference between the two frequency vectors is taken and normalized. In the case of the mean and variance option being active, the mean and variance of each array are calculated and their absolute differences are taken. Then each one of them is normalized by dividing with the maximum difference possible, for these particular arrays. The largest array is always taken as a reference point.
When calculating the PCM of matrices (which need to have the same number of dimensions), the PCM for each one of their columns is calculated first. Then, an average of these values is taken and used as the PCM of the whole matrix. The PCM method also yields the PCMs of the individual variables as part of its output.
PCM is great for figuring out the similarity of two arrays through an in-depth view of the data involved. Instead of looking at just the mean and variance metrics, which can be deceiving, it makes use of the distribution data too. The fact that it doesn’t assume any particular distribution is a plus since it allows for biased samples to be considered. Overall, it’s a useful heuristic to know, especially if you prefer an alternative approach to analytics than what Stats has to offer.
Recently I attended JuliaCon 2018, a conference about the Julia language. There people talked about the various cool things the language has to offer and how it benefits the world (not just the scientific world but the other parts of the world too). Yet, as it often happens to open-minded conferences like this one, there are some unusual ideas and insights that float around during the more relaxed parts of the conference. One such thing was the Nim language (formerly known as Nimrod language, a very promising alternative to Julia), since one Julia user spoke very highly of it.
As I’m by no means married to this technology, I always explore alternatives to it, since my commitment is to science, not the tools for it. So, even though Julia was at an all-time high in terms of popularity that week, I found myself investigating the merits of Nim, partly out of curiosity and partly because it seemed like a more powerful language than the tools that dominate the data science scene these days.
I’m still investigating this language but so far I’ve found out various things about it that I believe they are worth sharing. First of all, Nim is like C but friendlier, so it’s basically a high-level language (much like Julia) that exhibits low-level language performance. This high performance stems from the fact that Nim code compiles to C, something unique for a high-level language.
Since I didn’t know about Nim before then, I thought that it was a Julia clone or something, but then I discovered that it was actually older than Julia (about 4 years, to be exact). So, how come few people have heard about it? Well, unlike Julia, Nim doesn’t have a large user community, nor is it backed up by a company. Therefore, progress in its code base is somewhat slower. Also, unlike Julia, it’s still in version 0.x (with x being 18 at the time of this writing). In other words, it’s not considered production ready.
Who cares though? If Nim is as powerful as it is shown to be, it could still be useful in data science and A.I., right? Well, theoretically yes, but I don’t see it happening soon. The reason is three-fold. First of all, there are not many libraries in that language and as data scientists love libraries, it’s hard for the language to be anyone’s favorite. Also, there isn’t a REPL yet, so for a Nim script to run you need to compile it first. Finally, Nim doesn’t integrate with popular IDEs such as Jupyter and Atom, and as data scientists love their IDEs, it’s quite difficult for Nim to win many professionals in our field without IDE integration.
Beyond these reasons, there are several more that make Nim an interesting but not particularly viable option for a data science / A.I. practitioner. Nevertheless, the language holds a lot of promise for various other applications and the fact that it’s been around for so long (esp. considering that it exists without a company to support its development) is quite commendable. What’s more, there is at least one book out there on the language, so there must be a market for it, albeit a quite niche one.
So, should you try Nim? Sure. After all, the latest release of it seems quite stable. Should you use it for data science or A.I. though? Well, unless you are really fond of developing data science / A.I. libraries from scratch, you may want to wait a bit.
It seems like yesterday when I came up with this encryption system, for which I even wrote on this blog about. I never expected to create a video on it, but what better way to share it with the world, at least its core aspects of it. As there is no reason why I'd consider my implementation of this idea the best possible, I leave the viewer to experiment on his/her own on that matter, after I explain each aspect of the method and showcase a couple of examples of it. Anyway, check out the video on Safari when you get the chance and let me know here what you think of it. Enjoy!
Why Articles on Social Media about Programming for Data Science Seem to Be Straight Out of a Time Capsule
Data science related topics sell, no doubt about that. This is great is you are interested in the field and want to learn more about it, especially practical things that can offer you some orientation in the field. Since programming is a key component of data science, it makes sense to pay attention to material along these lines, particularly if you are new to this whole matter.
How the Situation Is Today
Fortunately there is an abundance of articles on this topic, especially on the social media. However, not everyone who writes such articles is up-to-date on this subject since many of these “expert” tech writers are not forward thinking data scientists themselves. Best case scenario, they have spend a few minutes on the web, probably focusing on the results on the first page of a search engine for the bulk of their material. And shocking as it may be, this material may be geared more towards what’s more popular rather on what’s more accurate. Alternatively, they may have relied on what some data science guru once said on the topic, information that may no longer be particularly relevant. Apart from that, the writers who delve into the production of this sort of articles (or infographics in some cases) have their own biases. Probably they took a programming course at university so if a particular programming platform comes up on their “research” they may be more likely to highlight it. After all, this would make them knowledgeable since they have hands-on experience on that platform, even if it’s not that useful to data science any more. What’s more, many people who write about these topics don’t want to take risks with newer things. It’s much safer to mention languages that everyone knows about and which have a large community around them, than mention newer ones that may be despised by the hardcore users of older coding platforms.
Hope for the Future
For better or for worse, an article on the social media has a limited life span. After all, its purpose is mainly to get enough people to click on a particular link where a given site serves ads, so that the people owning the site can get some revenue from said ads. Therefore, if the article is forgotten in a week, its producers won’t lose any sleep over it. Books and subscription-based videos are not like that though. Neither are technical conferences. So, since the new trends are geared more towards this kind of platforms to become well-known, they are not that much hindered by social media misinformation. After all, if a programming language is good, this is something that will eventually show, even if the fan-boys of the more traditional languages would sooner die than change their views on their favorite coding platforms.
What You Can Do
So, instead of getting swayed by this or the other “expert” with X thousand followers (many of whom are probably either bots or bought followers), you can do your own research. Check out what books are out there on the various programming languages and if they hint towards applicability in data science. Check out videos on Safari and other serious educational platforms. Look at what new language conferences are out there and how they cover data science related topics. And most importantly, try some of these languages yourself. This way you’ll have some more reliable data when making a decision on what language is most relevant and most future-proof in our field, rather than blindly believe whatever this or the other “expert” on the social media says.
JuliaRun is Julia’s latest cloud-based version. In my book, Julia for Data Science, I’ve mentioned that there is an online version of the language, called JuliaBox. This version uses Jupyter as its front-end and runs on the cloud. JuliaRun is the next version of JuliaBox, still using Jupyter, but also offering various scalability options. JuliaRun is powered by the Microsoft cloud, aka Azure. However, there is an option of running it on your own cluster (ask the Julia Computing people for details).
Signing in JuliaRun is a fairly simple process. You just need to use either your GitHub credentials or your Google account. It’s not clear why someone has to be tied to an external party instead of having a Julia Computing account, but since creating a Google account is free, it's not a big issue! Also, it is a bit peculiar that JuliaRun doesn’t support Microsoft credentials, but then again, a MS account is not as popular as these other two sign-in options.
After you sign in, you need to accept the Terms of Service, a fairly straight-forward document, considering that it is a legal one. The most useful take-away from it is that if you leave your account inactive for about 4 months, it’s gone, so this is not for people who are not committed to using it.
Once you accept the ToS, you are taken to an IJulia directory, on Jupyter. This is where all your code notebooks are stored. The file system has a few things there already, the most noteworthy of which being a few tutorials. These are very helpful to get you started and also to demonstrate how Julia works in this platform. If you’ve never used IJulia before, there are also a good guide for that. Note that IJulia can run on Jupyter natively too, once you install the IJulia package and the Jupyter platform, on your machine.
Kernel and Functionality
The Julia version being used on JuliaRun is the latest stable release, which at the time of this writing is 0.6. However, the kernel version may differ for certain notebooks (e.g. for the JuliaTutorial one, it’s 0.5.2). Still, the differences between the last couple of versions are minute, for the most part. I’d recommend you go through the tutorials and also create some of your own test notebooks, before starting on a serious project, unless of course you use IJulia already on your computer.
Adding packages is fairly straight-forward, though it can be time-consuming as a process, especially if you have a lot of packages to install. Also, you have the option of installing a package in either one of the two latest versions of the language, or both, if you prefer. If you are more adventurous, you can even installed an unregistered package, by providing the corresponding URL.
You can also add code to JuliaRun through a Git repository (not necessarily GitHub). You just need to specify the URL of the repository, the branch, and which folder on JuliaBox you want to clone it in.
JuliaRun also offers a brief, but useful, help option. It mainly consists of a few FAQs, as well as an email address for more specialized questions. This is probably better than the long help pages in some other platforms that are next to impossible to navigate and are written by people who are terribly at writing. The help on this platform is brief, but comprehensive and with the user in mind.
For those who are closer to the metal and prefer the direct interaction with the Julia kernel, rather than the IJulia notebook interface, there is also the option to start a terminal. You can access that via the New button at the directory page.
From what I’ve seen of JuliaRun, both through a demo from the Julia team, and through my own experience, it is fairly easy to use. What I found very useful is that it doesn’t require any low-level data engineering expertise, though if you are good at working the processes of a cloud platform through the terminal or via Jupyter, that’s definitely useful. However, if you are someone more geared towards the high-level aspects of the craft, you can still do what you need to do, without spending too much time on the configurations.
I’d love to write more about this great platform that takes Julia to the next level, but this post is already too long. So, whenever you have a chance, give it a try and draw your own conclusions about this immensely useful tool.
People like to argue, especially about things they can reason with. However, just because you can justify that your view has merit, giving some practical examples or through logical reasoning, this doesn't make alternative views invalid. If there are several programming languages in data science, perhaps an oversimplification like “X is the best language for data science because Y” doesn't hold much water. Let’s examine why.
Although it is possible to rule out certain languages (e.g. Assembly or C) as optimal for data science, this doesn't mean that the problem has a clear-cut solution. Also, the assumption that a single programming language can cover all the use cases of a data science professional is a quite unjustifiable one. Some data scientists use two or three programming languages, sometimes in combination, getting the best of each, for optimal overall performance.
Also, data science is all about solving a business problem in a scientific manner. Just because say Dr. Smith prefers to use language X over Y, it doesn't mean that you have to follow her example. Maybe she has used language X during her PhD and didn't have time to learn another language, or she attained mastery of that language, so she feels more comfortable doing her data science work with that. She may be a successful data scientist but following her programming habits won’t make you a great data scientist necessarily.
Moreover, with new languages and new packages in the existing languages coming about all the time, which language is best is like the best performing basketball team. Definitely not something particularly stable! Besides, it’s often the case that a particular project may requite special handling, so what is a top-performer now, may not be the best option for that particular case.
In addition, the almost religious attitude towards programming languages that many people have (not just data scientists) is by itself problematic. If a potential employer sees you arguing about how your language of choice is the best and that you are not open to consider alternatives, he may not be so eager to hire you, since this kind of attitude creates disharmony and difficulty in collaboration among the members of a team. Besides, in most companies nowadays, they rarely ask for a specific language in the candidate requirements. As long as you can do the task that’s required of you, they don’t really care much what your programming background is. Of course companies that have already invested in a particular language and have all their code in that language may not be so flexible, but that shouldn't be the principle factor in your decision about which language you learn.
Finally, when it comes to deep learning, many modern frameworks, like Apache’s MXNet, have APIs for a variety of programming language. So if your A.I. guru friend tries to convince you that you should learn language X because that’s the best deep learning language, take that suggestion with a pinch of salt!
The important thing is for whatever language you decide to learn for data science, you make sure that you learn it well. Familiarize yourself with its packages, use it to solve various problems, and learn the best strategies for debugging code written in that language. If you do that, you can still make good use of it for your data science projects, even if the majority of people prefer this or the other language instead.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.