What is an API?
In Computer Science, an API is short for Application Programming Interface. This is in essence a facilitator for an organization (e.g. a company) to share information with its clients and partners over the internet, oftentimes bypassing websites. And API is designed for computer programs so it’s usually developers that deal with this tech, though many data scientists and business people are getting involved in this promising piece of technology.
Why are APIs important?
APIs make prototyping a service super-fast, while they enable easier and more scalable leveraging of data. The latter can come from all sorts of sources and systems since APIs are platform-agnostic. So, if you were to create a mobile app that employs geo-location data, along with various security processes (e.g. for user authentication), you can do this easily using APIs. Also, if you have a website already for handling this sort of information exchanges, you can use an API for your target audience to interact with your online system, without even having to go to your site (the API becomes a proxy for the back-end of your site enabling them to access it through the app). For these and other reasons APIs are very important today and an essential part of any data-driven organization.
Thoughts on the "API Success" book
So, what about the "API success" book by Nelson Petracek (Technics Publications)? Well, this book covers the topic from various angles, with a strong focus on the business side of it. It provides lots of examples justifying the value-add of APIs and where they fit in in a modern organization. The book is well-written and easy to read, despite the large number of acronyms used in it. Interestingly, the book covers marketing as well, making a strong case for using APIs in a business project, be it as the main product or part of a package. It even explores how APIs can facilitate partnerships with other organizations and the fostering of long-term business relationships. The author, who is a very hands-on person, has a good sense of humor and writes in a way that's engaging and easy to follow.
The strongest part of the book, in my view, is the various architectural and design-related tips and lots of advice on the life cycle of an API, along with the corresponding diagrams that make this quite comprehensible. As for shortcomings, the lack of any hands-on material or reference resources is the only one that stands out. Nevertheless, the rest of the book makes up for this, through comprehensive coverage of the topic from various angles.
How you can get this book at a 20% discount
Although this book is available in a variety of places, you can get it at a discounted price if you go to the publisher’s site and use the coupon code DSML at the checkout. The book is already reasonably priced (around $30 for the printed version) but why not get it at a lower price? After all, this is a book with evergreen content, something you’d like to refer to again and again, maybe even share with your team when building your own APIs. Check it out!
JuliaCon stands for Julia Conference and it’s an annual educational/promotional event that Julia Computing organizes. The latter is the Massachusetts based company that manages the development and evolution of the programming language. So, JuliaCon is its way of promoting the language and keeping everyone interested in it updated on its recent developments.
JuliaCon is primarily for programmers and members of the scientific community employing Julia in their work. However, it also appeals to Julia enthusiasts and anyone interested in the ecosystem of the language, as well as its numerous applications. It’s not targeted at data scientists per se though lately there are several sessions in the conference that involve Machine Learning and A.I. since lots of people are interested in these areas. Note that most of the people involved in these packages are not professional data scientists, though some of them are familiar with the field and have written papers about it (mostly academic papers). So, if you are looking to learn about data science in this conference you may be disappointed, yet if you just wish to explore what Julia brings to the table when it comes to data science tools, you may be in for a treat.
This year several interesting things were revealed in the JuliaCon, which I attended. Namely, the Tuesday workshop on improving Julia code performance and compatibility with other programming languages was truly worth it as it covered a variety of tweaks that can make a script use less memory and/or work faster. Also, being able to incorporate Python and C code in a Julia script was something that was covered thoroughly, more than any documentation page could.
Unfortunately, some sessions weren’t properly synced and were either delayed or altogether missing from the live stream (at least on my Firefox browser). This definitely took away from the whole conference experience. Perhaps if the whole conference was done on Zoom, it would have been a smoother experience. The Q&A chat in the workshops was a nice touch though and added a lot to them.
The sessions themselves were pretty good overall, covering a variety of topics, from the more technical to the more application-oriented. They were organized in different tracks, making it easy to find the session you were most interested in. The Interactive data dashboards with Julia and Stipple session stood out. Even though it was a fairly short one, it was very relevant to data science work and with good examples, showcasing its functionality. I’d definitely recommend you watch the recording of it, which should be available by now at the Julia YouTube channel, along with the other sessions of the conference.
JuliaCon usually takes place in either the US or Europe. This year it was Europe’s turn to host the conference and it was scheduled to take place in Lisbon, Portugal. Although that laid-back Mediterranean capital would be ideal for such a conference (definitely more accessible than London, where it took place a couple of years ago), this year for the first time it took place online. This was due to the safety measures related to Covid-19 that impacted logistics severely. Anyway, if all goes well, it's expected that next year it will take place in the US. If you wish to delve more into Julia feel free to check out my books on the subject. Cheers!
Contrary to what the image suggests, this article is about the born again shell, aka BASH. BASH is essential in any serious work on the GNU/Linux OS, which believe it or not is by far the most popular OS in the world, considering that the vast majority of servers run on it. Also, Linux-based OSes are popular among data scientists due to the versatility they offer, as well as their reliable performance.
BASH is the way you interact with the computer directly when using a GNU/Linux system. In a way it is its own programming language, one that is on par with all the low-level languages out there, such as C. BASH is ideal for ETL processes, as well as any kind of automation. For example, the Cron jobs you may set up using Crontab are based on BASH.
Of course, nothing beats a good graphical user interface (GUI) when it comes to ease-of-use, which is why most Linux-based OSes make use of a GUI too (e.g. KDE, Gnome, or Xfce). However, it's the command-line interface (CLI) that's the most powerful between the two. Also, the CLI is more or less the same in all Linux-based OSes and makes use of commands that are almost identical to those on the UNIX system. The programmatic framework of all this is naturally the BASH.
BASH is not for the fainthearted, however. Unlike most programming languages used in data science (e.g. Python, Julia, Scala, etc.), BASH is not as easy to work with as it has an old-school kind of style. However, if you are proficient in any programming language, BASH doesn't seem so daunting. Also, scripts that are written in it as super fast and can do a lot of useful tasks promptly. Yet, even without getting your hands dirty with scripting on BASH, you can do a lot of stuff using aliases, which are defined in the .bash_aliases file.
A cool thing about BASH is that you can call it through other programming environments too. In Julia, for example, you can either go directly to the shell using the semi-colon shortcut (;), or you can run a BASH command using the format run(`command`). Note that for this to work you need to use the appropriate character to define the command string (`).
Interestingly, a lot of the commands in Julia are inspired by the BASH (e.g. pwd() is a direct reference to the pwd BASH command). So, if you are comfortable with Julia, you’ll find the BASH quite accessible too. Not easy, but accessible.
Tools like BASH may not be easy to master (I’m still working on that myself, after all these years) but they can be useful even if you just know the basics. For example, it doesn’t take too much expertise to come up with a useful script that can speed up processes you normally use. For example, the RIP script I created recently can terminate up to three different processes that give you a hard time (yes, even in Linux-based OSes sometimes a process gets stuck!). Also, you can have another script to facilitate backups or even program updates and upgrades. The sky is the limit!
All this may seem irrelevant to data science and A.I. work but the more comfortable you are with low-level programming like BASH, the easier it is to build high-performance pipelines, bringing your other programs to life. After all, you can’t get closer to the metal than with BASH, without using a screwdriver!
Lately, I've been working with Julia a lot, partly because of my new book which was released a couple of weeks ago and partly because it's fun. If you are somewhat experienced with data handling, you'll probably familiar with the JLD and JLD2 packages that are built on top of the FileIO one. Of these, JLD2 is faster and generally better but it seems that it's no longer maintained, while the files it produces are quite bulky. This may not be an issue with a smaller dataset but when you are dealing with lots of data, it can be quite time-consuming to load or save data using this package.
Enter the CDF framework, which is short for Compressed Data Format. This is still work in progress so don’t expect miracles here! However, I made sure it can handle the most common data types, such as scalars of any type (including complex string variables), the most common arrays (including BitArrays) for up to two dimensions, and of course dictionaries containing the above data structures. It does not support yet tensors or any kind of Boolean array (which for some bizarre reason is distinct from the BitArray data structure).
CDF has two types of functions within it: those converting data into a large text string (and vice versa) and those dealing with IO operations, using binary files. The compression algorithm employed is the Deflate one from the CodecZlib package as it seems to have better performance than the other algorithms in the package. Remember that all of these algorithms take a text variable as an input, which is why everything is converted into a string before compressed and stored into a data file.
The CDF framework doesn't have many of the options JLD2 has but for the files created with it, the performance edge is quite obvious. A typical dataset can be around 20 times smaller and many times faster to read or write, as the compression algorithm is fairly fast and there is fewer data to process with the computationally heavy IO commands. As a codebase, CDF is significantly simpler than other packages while the main functions are easy as pie. In fact, the whole framework is quite intuitive that there is not really any need for documentation for it.
If you are interested in using this framework and/or extending it, feel free to contact me through this blog. Be sure to mention any affiliation(s) of yours too. Cheers!
Although it's fairly easy to compare two continuous variables and assess their similarity, it's not so straight-forward when you perform the same task on categorical variables. Of course, things are fairly simple when the variables at hand are binary (aka dummy variables), but even in this case, it's not as obvious as you may think.
For example, if two variables are aligned (zeros to zeros and ones to ones), that’s fine. You can use Jaccard similarity to gauge how similar they are. But what happens when the two variables are reversely similar (the zeros of the first variable correspond to the ones of the second, and vice versa)? Then Jaccard similarity finds them dissimilar though there is no doubt that such a pair of variables may be relevant and the first one could be used to predict the second variable. Enter the Symmetric Jaccard Similarity (SJS), a metric that can alleviate this shortcoming of the original Jaccard similarity. Namely, it takes the maximum of the two Jaccard similarities, one with the features as they are originally and one with one of them reversed.
SJS is easy to use and scalable, while its implementation in Julia is quite straight-forward. You just need to be comfortable with contingency tables, something that’s already an easy task in this language, though you can also code it from scratch without too much of a challenge. Anyway, SJS is fairly simple a metric, and something I've been using for years now. However, only recently did I explore its generalization to nominal variables, something that’s not as simple as it may first seem.
Applying the SJS metric to a pair of nominal variables entails maximizing the potential similarity value between them, just like the original SJS does for binary variables. In other words, it shuffles the first variables until the similarity of it with the second variable is maximized, something that’s done in a deterministic and scalable manner. However, it becomes apparent through the algorithm that SJS may fail to reveal the edge that a non-symmetric approach may yield, namely in the case where certain values of the first variable are more similar toward a particular value of the second variable. In a practical sense it means that certain values of the nominal feature at hand are good at predicting a specific class, but not all of the classes.
That's why an exhaustive search of all the binary combinations is generally better, since a given nominal feature may have more to offer in a classification model if it's broken down into several binary ones. That's something we do anyway, but this investigation through the SJS metric illustrates why this strategy is also a good one.
Of course, SJS for nominal features may be useful for assessing if one of them is redundant. Just like we apply some correlation metric for a group of continuous features, we can apply SJS for a group of nominal features, eliminating those that are unnecessary, before we start breaking them down into binary ones, something that can make the dataset explode in size in some cases.
All this is something I’ve been working on the other day, as part of another project. In my latest book “Julia for Machine Learning” (Technics Publications) I talk about such metrics (not SJS in particular) and how you can develop them from scratch in this programming language. Feel free to check it out. Cheers!
The concept of antifragility is well-established by Dr. Taleb and has even been adopted by the mainstream to some extent (e.g. in Investopedia). This is a vast concept and it’s unlikely that I can do it justice, especially in a blog post. That’s why I suggest you familiarize yourself with it first before reading the rest of this article.
Antifragility is not only desirable but also essential to some extent, particularly when it comes to data science / AI work. Even though most data models are antifragile by nature (particularly the more sophisticated ones that manage to get every drop of signal from the data they are given), there are fragilities all over the place when it comes to how these models are used. A clear example of this is the computer code around them. I’m not referring to the code that’s used to implement them, usually coming from some specialized packages. That code is fine and usually better than most code found in data science / AI projects. The code around the models, however, be it the one taking care of ETL work, feature engineering, and even data visualization, may not always be good enough.
Antifragility applies to computer code in various ways. Here are the ones I’ve found so far:
All this may seem like a lot of work and it may not agree with your temporal restrictions, particularly if you have strict deadlines. However, you can always improve on your code after you’ve cleared a milestone. This way, you can avoid some Black Swans like an error being thrown while the program you’ve made is already in production. Cheers!
For over 2 decades there is a puzzle game I've played from time to time, usually to pass the time creatively or to challenge myself in algorithm development. This game, which I was taught by a friend, didn't have a name and I never managed to find it elsewhere so I call it Numgame (as it involves numbers and it's a game). Over the years, I managed to solve many of its levels though I never got an algorithm for it, until now.
The game involves a square grid, originally a 10-by-10 one. The simplest grid that's solvable is the 5-by-5 one. The object of the game is to fill the grid with numbers, starting from 1 and going all the way to n^2, where n is the size of the grid, which can be any number larger than 4 (grids of this size or lower are not solvable).
To fill the grid, you can "move" horizontally, vertically and diagonally, as long as the cell you go to is empty. When moving horizontally or vertically you need to skip 2 squares, while when you move diagonally you need to skip 1. Naturally, as you progress, getting to the remaining empty squares becomes increasingly hard. That's why you need to have a strategy if you are to finish the game successfully.
Naturally, not all starting positions yield a successful result. Although more often than not you'd start from a corner, you may choose to start from any other square in the grid. That's useful, considering that some grids are just not solvable if you start from a corner (see image below; empty cells are marked as zeros)
Before we look at the solution I've come across, try to solve a grid on your own and think about a potential algorithm to solve any grid. At the very least, you'll gain an appreciation of the solution afterward.
Anyway, the key to solving the Numgame levels is to use a heuristic that will help you assess each move. In other words, you'll need to figure out a score that discerns between good and bad positions. The latter result from the various moves. So, for each cell in the grid, you can count how many legitimate ways are there for accessing it (i.e. ways complying with the aforementioned rules). You can store these numbers in a matrix. Then, you can filter out the cells that have been occupied already, since we won't be revisiting them anyway. This leaves us with a list of numbers corresponding to the number of ways to reach the remaining empty cells.
Then we can take the harmonic mean of these numbers. I chose the harmonic mean because it is very sensitive to small numbers, something we want to avoid. So, the heuristic will take very low values if even a few cells start becoming inaccessible. Also, if even a single cell becomes completely inaccessible, the heuristic will take the value 0, which is also the worst possible score. Naturally, we aim to maximize this heuristic as we examine the various positions stemming from all the legitimate moves of each position. By repeating this process, we either end up with a full grid or one that doesn't progress because it's unsolvable.
This simple problem may seem obvious now, but it is a good example of how a simple heuristic can solve a problem that's otherwise tough (at least for someone who hasn't tackled it enough to figure out a viable strategy). Naturally, we could brute-force the whole thing, but it's doubtful that this approach would be scalable. After all, in the era of A.I. we are better off seeking intelligent solutions to problems, rather than just through computing resources at them!
Lately, I've been busy with preparations for my conference trips, hence my online absence. Nevertheless, I found time to write something for you all who keep an open mind to non-hyped data science and A.I related content. So, this time I'd like to share a few thoughts on programming for data science, from a somewhat different perspective.
First of all, it doesn't matter that much what language you use, if you have attained mastery of it. Even sub-Julia languages can be useful if you know how to use them well. However, in cases where you use a less powerful language, you need to know about lambda functions. I mastered this programming technique only recently because in Julia the performance improvement is negligible (unless your original code is inefficient to start with). However, as they make for more compact scripts, it seems like useful know-how to have. Besides, they have numerous uses in data science, particularly when it comes to:
Another thing that I’ve found incredibly useful, and which I mastered in the past few weeks, is the use of auxiliary functions for refactoring complex programs. A large program is bound to be difficult to comprehend and maintain, something that often falls into the workload of someone else you may not have a chance to help out. As comments in your script may also prove insufficient, it’s best to break things down to smaller and more versatile functions that are combined in your wrapper function. This modular approach, which is quite common in functional programming, makes for more useful code, which can be reused elsewhere, with minor modifications. Also, it’s the first step towards building a versatile programming library (package).
Moreover, I’ve rediscovered the value of pen and paper in a programming setting. Particularly when dealing with problems that are difficult to envision fully, this approach is very useful. It may seem rudimentary and not something that a "good data scientist" would do, but if you think about it, most programmers also make use of a whiteboard or some other analog writing equipment when designing a solution. It may seem like an excessive task that may slow you down, but in the long run, it will save you time. I've tried that for testing a new graph algorithm I've developed for figuring out if a given graph has cycles (cliques) in it or not. Since drawing graphs is fairly simple, it was a very useful auxiliary task that made it possible to come up with a working solution to the problem in a matter of minutes.
Finally, I discovered again the usefulness of in-depth pair-coding, particularly for data engineering tasks. Even if one's code is free of errors, there are always things that could use improvement, something that can be introduced through pair-coding. Fortunately, with tools like Zoom, this is easier than ever before as you don't need to be in the same physical room to perform this programming technique. This is something I do with all my data science mentees, once they reach a certain level of programming fluency and according to the feedback I've received, it is what benefits them the most.
Hopefully, all this can help you clarify the role of programming in data science a bit more. After all, you don't need to be a professional coder to make use of a programming language in fields like data science.
Throughout this blog, I've talked about all sorts of problems and how solving them can aid one's data science acumen as well as the development of the data science mindset. Problem-Solving skills rank high when it comes to the soft skills aspect of our craft, something I also mentioned in my latest video on O'Reilly. However, I haven't talked much about how you can hone this ability.
Enter Brilliant, a portal for all sorts of STEM-related courses and puzzles that can help you develop problem-solving, among other things. If you have even a vague interest in Math and the positive Sciences, Brilliant can help you grow this into a passion and even a skill-set in these disciplines. The most intriguing thing about all this is that it does so in a fun and engaging way.
Naturally, most of the stuff Brilliant offers comes with a price tag (if it didn't, I would be concerned!). However, the cost of using the resources this site offers is a quite reasonable one and overall good value for money. The best part is that by signing up there you can also help me cover some of the expenses of this blog, as long as you use this link here: www.brilliant.org/fds (FDS stands for Foxy Data Science, by the way). Also, if you are among the first 200 people to sign up you'll get a 20% discount, so time is definitely of the essence!
Note that I normally don't promote anything of this blog unless I'm certain about its quality standard. Also, out of respect for your time I refrain from posting any ads on the site. So, whenever I post something like this affiliate link here I do so after careful consideration, opting to find the best way to raise some revenue for the site all while providing you with something useful and relevant to it. I hope that you view this initiative the same way.
Short answer: Nope! Longer answer: clustering can be a simple deterministic problem, given that you figure out the optimal centroids to start with. But isn’t the latter the solution of a stochastic process though? Again, nope. You can meander around the feature space like a gambler, hoping to find some points that can yield a good solution, or you can tackle the whole problem scientifically. To do that, however, you have to forget everything you know about clustering and even basic statistics, since the latter are inherently limited and frankly, somewhat irrelevant to proper clustering.
Finding the optimal clusters is a two-fold problem: 1. you need to figure out which solutions make sense for the data (i.e. a good value for K), and 2. figure out these solutions in a methodical and robust manner. The former has been resolved as a problem and it’s fairly trivial. Vincent Granville talked about it in his blog, many years ago and since he is better at explaining things than I am, I’m not going to bother with that part at all. My solution to it is a bit different but it’s still heuristics-based. The 2nd part of the problem is also the more challenging one since it’s been something many people have been pursuing a while now, without much success (unless you count the super slow method of DBSCAN, with more parameters than letters in its name, as a viable solution).
To find the optimal centroids, you need to take into account two things, the density of each centroid and the distances of each centroid to the other ones. Then you need to combine the two in a single metric, with you need to maximize. Each one of these problems seems fairly trivial, but something that many people don’t realize is that in practice, it’s very very hard, especially if you have multi-dimensional data (where conventional distance metrics fail) and lots of it (making the density calculations a major pain). Fortunately, I found a solution to both of these problems using 1. a new kind of distance metric, that yields a higher U value (this is the heuristic used to evaluate distance metrics in higher dimensional space), though with an inevitable compromise, and 2. a far more efficient way of calculating densities. The aforementioned compromise is that this metric cannot guarantee that the triangular inequality holds, but then again, this is not something you need for clustering anyway. As long as the clustering algo converges, you are fine.
Preliminary results of this new clustering method show that it’s fairly quick (even though it searches through various values of K to find the optimum one) and computationally light. What’s more, it is designed to be fairly scalable, something that I’ll be experimenting with in the weeks to come. The reason for the scalability is that it doesn’t calculate the density of each data point, but of certain regions of the dataset only. Finding these regions is the hardest part, but you only need to do that once, before you start playing around with K values.
Anyway, I’d love to go into detail about the method but the math I use is different to anything you’ve seen and beyond what is considered canon. Then again, some problems need new math to be solved and perhaps clustering is one of them. Whatever the case, this is just one of the numerous applications of this new framework of data analysis, which I call AAF (alternative analytics framework), a project I’ve been working on for more than 10 years now. More on that in the coming months.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.