Benchmarking is the process of measuring a script's performance in terms of the time it takes to run and memory it requires. It is an essential part of programming and it's particularly useful for developing scalable code. Usually, it involves a more detailed analysis of the code, such as profiling, so we know exactly which parts of the script are run more often and what proportion of the overall time they take. As a result, we know how we can optimize the script using this information, making it lighter and more useful.
Benchmarking is great as it allows us to optimize our scripts but what does this mean for us as data scientists? From a practical perspective, it enables us to work with larger data samples and save time. This extra time we can use for more high-level thinking, refining our work. Also, being able to develop high-performance code can make us more independent as professionals, something that has numerous advantages, especially when dealing with large scale projects. Finally, benchmarking allows us to assess the methods we use (e.g. our heuristics) and thereby make better decisions regarding them.
In Julia, in particular, there is a useful package for benchmarking, which I discovered recently through a fellow Julia user. It’s called Benchmarking Tools and it has a number of useful functions you can use for accurately measuring the performance of any script (e.g. the @btime and @benchmark macros which provide essential performance statistics). With these measures as a guide, you can easily improve the performance of a Julia script, making it more scalable. Give it a try when you get the chance.
Note that benchmarking may not be a sufficient condition for improving a script, by the way. Unless you take action to change the script, perhaps even rewrite it using a different algorithm, benchmarking can't do much. After all, the latter is more like an objective function that you try to optimize. How it changes is really up to you! This illustrates that benchmarking is really just one part of the whole editing process.
What’s more, note that benchmarking needs to be done on scripts that are free of bugs. Otherwise, it wouldn’t be possible to assess the performance of the script since it wouldn’t run to its completion. Still, you can evaluate parts of it independently, something that a functional approach to the program would enable.
Finally, it’s always good to remember this powerful methodology for script optimization. Its value in data science is beyond doubt, plus it can make programming more enjoyable. After all, for those who can appreciate elegance in a script, a piece of code can be a work of art, one that is truly valuable.
A functional language is a programming language that is based on the functional paradigm of coding, whereby every process of a program is a function. This allows for greater speed and mitigates the risk of bugs since it's much easier to figure out what's happening in a program as everything in it is modular. In the case of such a program, each module corresponds to a function, having its own variable space. Naturally, this helps conserve memory and make any methods developed this way more scalable.
Functional languages are very important nowadays as people are realizing that their advantages make them ideal in many performance-critical cases. Also, in cases where development speed is a factor, functional languages are preferred. It's important to remember though that many people still favor object-oriented programming (OOP) languages so the latter aren't going to go away any time soon. That's why there are lots of hybrid languages that combine elements of OOP and functional programming.
So far there have been a couple of functional languages that are relevant in data science projects. Namely, there is Scala (where Spark was developed on) and Julia, with the latter gaining popularity as more and more data science packages become available in it. Interestingly, ever since these languages have been shown to provide a performance edge (just like any other functional language), their value in data science has been undeniable, even if many data scientists prefer to use more traditional languages, such as Python.
What about the future of functional programming? Well, it seems quite promising, especially considering how many new programming languages of this paradigm exist nowadays. Also, the fact that there are new ones coming about goes to show that this way of programming is here to stay. Also, since the OOP paradigm has its advantages, it seems quite likely that newer functional languages are bound to be hybrid, to lure more practitioners who are already accustomed (and to some extent vested) in the OOP way of programming. Moreover, functional languages are bound to become more specialized since there are enough of them now to need a niche in order to stand out. In fact, some of them, as for example Julia, appear to have done just that.
If you wish to learn more about the Julia functional language and its application on data science, I have authored two books about it through the Technics Publications publishing house. Feel free to check them out here and learn more about this fascinating functional language. Cheers!
What is an API?
In Computer Science, an API is short for Application Programming Interface. This is in essence a facilitator for an organization (e.g. a company) to share information with its clients and partners over the internet, oftentimes bypassing websites. And API is designed for computer programs so it’s usually developers that deal with this tech, though many data scientists and business people are getting involved in this promising piece of technology.
Why are APIs important?
APIs make prototyping a service super-fast, while they enable easier and more scalable leveraging of data. The latter can come from all sorts of sources and systems since APIs are platform-agnostic. So, if you were to create a mobile app that employs geo-location data, along with various security processes (e.g. for user authentication), you can do this easily using APIs. Also, if you have a website already for handling this sort of information exchanges, you can use an API for your target audience to interact with your online system, without even having to go to your site (the API becomes a proxy for the back-end of your site enabling them to access it through the app). For these and other reasons APIs are very important today and an essential part of any data-driven organization.
Thoughts on the "API Success" book
So, what about the "API success" book by Nelson Petracek (Technics Publications)? Well, this book covers the topic from various angles, with a strong focus on the business side of it. It provides lots of examples justifying the value-add of APIs and where they fit in in a modern organization. The book is well-written and easy to read, despite the large number of acronyms used in it. Interestingly, the book covers marketing as well, making a strong case for using APIs in a business project, be it as the main product or part of a package. It even explores how APIs can facilitate partnerships with other organizations and the fostering of long-term business relationships. The author, who is a very hands-on person, has a good sense of humor and writes in a way that's engaging and easy to follow.
The strongest part of the book, in my view, is the various architectural and design-related tips and lots of advice on the life cycle of an API, along with the corresponding diagrams that make this quite comprehensible. As for shortcomings, the lack of any hands-on material or reference resources is the only one that stands out. Nevertheless, the rest of the book makes up for this, through comprehensive coverage of the topic from various angles.
How you can get this book at a 20% discount
Although this book is available in a variety of places, you can get it at a discounted price if you go to the publisher’s site and use the coupon code DSML at the checkout. The book is already reasonably priced (around $30 for the printed version) but why not get it at a lower price? After all, this is a book with evergreen content, something you’d like to refer to again and again, maybe even share with your team when building your own APIs. Check it out!
JuliaCon stands for Julia Conference and it’s an annual educational/promotional event that Julia Computing organizes. The latter is the Massachusetts based company that manages the development and evolution of the programming language. So, JuliaCon is its way of promoting the language and keeping everyone interested in it updated on its recent developments.
JuliaCon is primarily for programmers and members of the scientific community employing Julia in their work. However, it also appeals to Julia enthusiasts and anyone interested in the ecosystem of the language, as well as its numerous applications. It’s not targeted at data scientists per se though lately there are several sessions in the conference that involve Machine Learning and A.I. since lots of people are interested in these areas. Note that most of the people involved in these packages are not professional data scientists, though some of them are familiar with the field and have written papers about it (mostly academic papers). So, if you are looking to learn about data science in this conference you may be disappointed, yet if you just wish to explore what Julia brings to the table when it comes to data science tools, you may be in for a treat.
This year several interesting things were revealed in the JuliaCon, which I attended. Namely, the Tuesday workshop on improving Julia code performance and compatibility with other programming languages was truly worth it as it covered a variety of tweaks that can make a script use less memory and/or work faster. Also, being able to incorporate Python and C code in a Julia script was something that was covered thoroughly, more than any documentation page could.
Unfortunately, some sessions weren’t properly synced and were either delayed or altogether missing from the live stream (at least on my Firefox browser). This definitely took away from the whole conference experience. Perhaps if the whole conference was done on Zoom, it would have been a smoother experience. The Q&A chat in the workshops was a nice touch though and added a lot to them.
The sessions themselves were pretty good overall, covering a variety of topics, from the more technical to the more application-oriented. They were organized in different tracks, making it easy to find the session you were most interested in. The Interactive data dashboards with Julia and Stipple session stood out. Even though it was a fairly short one, it was very relevant to data science work and with good examples, showcasing its functionality. I’d definitely recommend you watch the recording of it, which should be available by now at the Julia YouTube channel, along with the other sessions of the conference.
JuliaCon usually takes place in either the US or Europe. This year it was Europe’s turn to host the conference and it was scheduled to take place in Lisbon, Portugal. Although that laid-back Mediterranean capital would be ideal for such a conference (definitely more accessible than London, where it took place a couple of years ago), this year for the first time it took place online. This was due to the safety measures related to Covid-19 that impacted logistics severely. Anyway, if all goes well, it's expected that next year it will take place in the US. If you wish to delve more into Julia feel free to check out my books on the subject. Cheers!
Contrary to what the image suggests, this article is about the born again shell, aka BASH. BASH is essential in any serious work on the GNU/Linux OS, which believe it or not is by far the most popular OS in the world, considering that the vast majority of servers run on it. Also, Linux-based OSes are popular among data scientists due to the versatility they offer, as well as their reliable performance.
BASH is the way you interact with the computer directly when using a GNU/Linux system. In a way it is its own programming language, one that is on par with all the low-level languages out there, such as C. BASH is ideal for ETL processes, as well as any kind of automation. For example, the Cron jobs you may set up using Crontab are based on BASH.
Of course, nothing beats a good graphical user interface (GUI) when it comes to ease-of-use, which is why most Linux-based OSes make use of a GUI too (e.g. KDE, Gnome, or Xfce). However, it's the command-line interface (CLI) that's the most powerful between the two. Also, the CLI is more or less the same in all Linux-based OSes and makes use of commands that are almost identical to those on the UNIX system. The programmatic framework of all this is naturally the BASH.
BASH is not for the fainthearted, however. Unlike most programming languages used in data science (e.g. Python, Julia, Scala, etc.), BASH is not as easy to work with as it has an old-school kind of style. However, if you are proficient in any programming language, BASH doesn't seem so daunting. Also, scripts that are written in it as super fast and can do a lot of useful tasks promptly. Yet, even without getting your hands dirty with scripting on BASH, you can do a lot of stuff using aliases, which are defined in the .bash_aliases file.
A cool thing about BASH is that you can call it through other programming environments too. In Julia, for example, you can either go directly to the shell using the semi-colon shortcut (;), or you can run a BASH command using the format run(`command`). Note that for this to work you need to use the appropriate character to define the command string (`).
Interestingly, a lot of the commands in Julia are inspired by the BASH (e.g. pwd() is a direct reference to the pwd BASH command). So, if you are comfortable with Julia, you’ll find the BASH quite accessible too. Not easy, but accessible.
Tools like BASH may not be easy to master (I’m still working on that myself, after all these years) but they can be useful even if you just know the basics. For example, it doesn’t take too much expertise to come up with a useful script that can speed up processes you normally use. For example, the RIP script I created recently can terminate up to three different processes that give you a hard time (yes, even in Linux-based OSes sometimes a process gets stuck!). Also, you can have another script to facilitate backups or even program updates and upgrades. The sky is the limit!
All this may seem irrelevant to data science and A.I. work but the more comfortable you are with low-level programming like BASH, the easier it is to build high-performance pipelines, bringing your other programs to life. After all, you can’t get closer to the metal than with BASH, without using a screwdriver!
Lately, I've been working with Julia a lot, partly because of my new book which was released a couple of weeks ago and partly because it's fun. If you are somewhat experienced with data handling, you'll probably familiar with the JLD and JLD2 packages that are built on top of the FileIO one. Of these, JLD2 is faster and generally better but it seems that it's no longer maintained, while the files it produces are quite bulky. This may not be an issue with a smaller dataset but when you are dealing with lots of data, it can be quite time-consuming to load or save data using this package.
Enter the CDF framework, which is short for Compressed Data Format. This is still work in progress so don’t expect miracles here! However, I made sure it can handle the most common data types, such as scalars of any type (including complex string variables), the most common arrays (including BitArrays) for up to two dimensions, and of course dictionaries containing the above data structures. It does not support yet tensors or any kind of Boolean array (which for some bizarre reason is distinct from the BitArray data structure).
CDF has two types of functions within it: those converting data into a large text string (and vice versa) and those dealing with IO operations, using binary files. The compression algorithm employed is the Deflate one from the CodecZlib package as it seems to have better performance than the other algorithms in the package. Remember that all of these algorithms take a text variable as an input, which is why everything is converted into a string before compressed and stored into a data file.
The CDF framework doesn't have many of the options JLD2 has but for the files created with it, the performance edge is quite obvious. A typical dataset can be around 20 times smaller and many times faster to read or write, as the compression algorithm is fairly fast and there is fewer data to process with the computationally heavy IO commands. As a codebase, CDF is significantly simpler than other packages while the main functions are easy as pie. In fact, the whole framework is quite intuitive that there is not really any need for documentation for it.
If you are interested in using this framework and/or extending it, feel free to contact me through this blog. Be sure to mention any affiliation(s) of yours too. Cheers!
Although it's fairly easy to compare two continuous variables and assess their similarity, it's not so straight-forward when you perform the same task on categorical variables. Of course, things are fairly simple when the variables at hand are binary (aka dummy variables), but even in this case, it's not as obvious as you may think.
For example, if two variables are aligned (zeros to zeros and ones to ones), that’s fine. You can use Jaccard similarity to gauge how similar they are. But what happens when the two variables are reversely similar (the zeros of the first variable correspond to the ones of the second, and vice versa)? Then Jaccard similarity finds them dissimilar though there is no doubt that such a pair of variables may be relevant and the first one could be used to predict the second variable. Enter the Symmetric Jaccard Similarity (SJS), a metric that can alleviate this shortcoming of the original Jaccard similarity. Namely, it takes the maximum of the two Jaccard similarities, one with the features as they are originally and one with one of them reversed.
SJS is easy to use and scalable, while its implementation in Julia is quite straight-forward. You just need to be comfortable with contingency tables, something that’s already an easy task in this language, though you can also code it from scratch without too much of a challenge. Anyway, SJS is fairly simple a metric, and something I've been using for years now. However, only recently did I explore its generalization to nominal variables, something that’s not as simple as it may first seem.
Applying the SJS metric to a pair of nominal variables entails maximizing the potential similarity value between them, just like the original SJS does for binary variables. In other words, it shuffles the first variables until the similarity of it with the second variable is maximized, something that’s done in a deterministic and scalable manner. However, it becomes apparent through the algorithm that SJS may fail to reveal the edge that a non-symmetric approach may yield, namely in the case where certain values of the first variable are more similar toward a particular value of the second variable. In a practical sense it means that certain values of the nominal feature at hand are good at predicting a specific class, but not all of the classes.
That's why an exhaustive search of all the binary combinations is generally better, since a given nominal feature may have more to offer in a classification model if it's broken down into several binary ones. That's something we do anyway, but this investigation through the SJS metric illustrates why this strategy is also a good one.
Of course, SJS for nominal features may be useful for assessing if one of them is redundant. Just like we apply some correlation metric for a group of continuous features, we can apply SJS for a group of nominal features, eliminating those that are unnecessary, before we start breaking them down into binary ones, something that can make the dataset explode in size in some cases.
All this is something I’ve been working on the other day, as part of another project. In my latest book “Julia for Machine Learning” (Technics Publications) I talk about such metrics (not SJS in particular) and how you can develop them from scratch in this programming language. Feel free to check it out. Cheers!
The concept of antifragility is well-established by Dr. Taleb and has even been adopted by the mainstream to some extent (e.g. in Investopedia). This is a vast concept and it’s unlikely that I can do it justice, especially in a blog post. That’s why I suggest you familiarize yourself with it first before reading the rest of this article.
Antifragility is not only desirable but also essential to some extent, particularly when it comes to data science / AI work. Even though most data models are antifragile by nature (particularly the more sophisticated ones that manage to get every drop of signal from the data they are given), there are fragilities all over the place when it comes to how these models are used. A clear example of this is the computer code around them. I’m not referring to the code that’s used to implement them, usually coming from some specialized packages. That code is fine and usually better than most code found in data science / AI projects. The code around the models, however, be it the one taking care of ETL work, feature engineering, and even data visualization, may not always be good enough.
Antifragility applies to computer code in various ways. Here are the ones I’ve found so far:
All this may seem like a lot of work and it may not agree with your temporal restrictions, particularly if you have strict deadlines. However, you can always improve on your code after you’ve cleared a milestone. This way, you can avoid some Black Swans like an error being thrown while the program you’ve made is already in production. Cheers!
For over 2 decades there is a puzzle game I've played from time to time, usually to pass the time creatively or to challenge myself in algorithm development. This game, which I was taught by a friend, didn't have a name and I never managed to find it elsewhere so I call it Numgame (as it involves numbers and it's a game). Over the years, I managed to solve many of its levels though I never got an algorithm for it, until now.
The game involves a square grid, originally a 10-by-10 one. The simplest grid that's solvable is the 5-by-5 one. The object of the game is to fill the grid with numbers, starting from 1 and going all the way to n^2, where n is the size of the grid, which can be any number larger than 4 (grids of this size or lower are not solvable).
To fill the grid, you can "move" horizontally, vertically and diagonally, as long as the cell you go to is empty. When moving horizontally or vertically you need to skip 2 squares, while when you move diagonally you need to skip 1. Naturally, as you progress, getting to the remaining empty squares becomes increasingly hard. That's why you need to have a strategy if you are to finish the game successfully.
Naturally, not all starting positions yield a successful result. Although more often than not you'd start from a corner, you may choose to start from any other square in the grid. That's useful, considering that some grids are just not solvable if you start from a corner (see image below; empty cells are marked as zeros)
Before we look at the solution I've come across, try to solve a grid on your own and think about a potential algorithm to solve any grid. At the very least, you'll gain an appreciation of the solution afterward.
Anyway, the key to solving the Numgame levels is to use a heuristic that will help you assess each move. In other words, you'll need to figure out a score that discerns between good and bad positions. The latter result from the various moves. So, for each cell in the grid, you can count how many legitimate ways are there for accessing it (i.e. ways complying with the aforementioned rules). You can store these numbers in a matrix. Then, you can filter out the cells that have been occupied already, since we won't be revisiting them anyway. This leaves us with a list of numbers corresponding to the number of ways to reach the remaining empty cells.
Then we can take the harmonic mean of these numbers. I chose the harmonic mean because it is very sensitive to small numbers, something we want to avoid. So, the heuristic will take very low values if even a few cells start becoming inaccessible. Also, if even a single cell becomes completely inaccessible, the heuristic will take the value 0, which is also the worst possible score. Naturally, we aim to maximize this heuristic as we examine the various positions stemming from all the legitimate moves of each position. By repeating this process, we either end up with a full grid or one that doesn't progress because it's unsolvable.
This simple problem may seem obvious now, but it is a good example of how a simple heuristic can solve a problem that's otherwise tough (at least for someone who hasn't tackled it enough to figure out a viable strategy). Naturally, we could brute-force the whole thing, but it's doubtful that this approach would be scalable. After all, in the era of A.I. we are better off seeking intelligent solutions to problems, rather than just through computing resources at them!
Lately, I've been busy with preparations for my conference trips, hence my online absence. Nevertheless, I found time to write something for you all who keep an open mind to non-hyped data science and A.I related content. So, this time I'd like to share a few thoughts on programming for data science, from a somewhat different perspective.
First of all, it doesn't matter that much what language you use, if you have attained mastery of it. Even sub-Julia languages can be useful if you know how to use them well. However, in cases where you use a less powerful language, you need to know about lambda functions. I mastered this programming technique only recently because in Julia the performance improvement is negligible (unless your original code is inefficient to start with). However, as they make for more compact scripts, it seems like useful know-how to have. Besides, they have numerous uses in data science, particularly when it comes to:
Another thing that I’ve found incredibly useful, and which I mastered in the past few weeks, is the use of auxiliary functions for refactoring complex programs. A large program is bound to be difficult to comprehend and maintain, something that often falls into the workload of someone else you may not have a chance to help out. As comments in your script may also prove insufficient, it’s best to break things down to smaller and more versatile functions that are combined in your wrapper function. This modular approach, which is quite common in functional programming, makes for more useful code, which can be reused elsewhere, with minor modifications. Also, it’s the first step towards building a versatile programming library (package).
Moreover, I’ve rediscovered the value of pen and paper in a programming setting. Particularly when dealing with problems that are difficult to envision fully, this approach is very useful. It may seem rudimentary and not something that a "good data scientist" would do, but if you think about it, most programmers also make use of a whiteboard or some other analog writing equipment when designing a solution. It may seem like an excessive task that may slow you down, but in the long run, it will save you time. I've tried that for testing a new graph algorithm I've developed for figuring out if a given graph has cycles (cliques) in it or not. Since drawing graphs is fairly simple, it was a very useful auxiliary task that made it possible to come up with a working solution to the problem in a matter of minutes.
Finally, I discovered again the usefulness of in-depth pair-coding, particularly for data engineering tasks. Even if one's code is free of errors, there are always things that could use improvement, something that can be introduced through pair-coding. Fortunately, with tools like Zoom, this is easier than ever before as you don't need to be in the same physical room to perform this programming technique. This is something I do with all my data science mentees, once they reach a certain level of programming fluency and according to the feedback I've received, it is what benefits them the most.
Hopefully, all this can help you clarify the role of programming in data science a bit more. After all, you don't need to be a professional coder to make use of a programming language in fields like data science.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.