What is an API?
In Computer Science, an API is short for Application Programming Interface. This is in essence a facilitator for an organization (e.g. a company) to share information with its clients and partners over the internet, oftentimes bypassing websites. And API is designed for computer programs so it’s usually developers that deal with this tech, though many data scientists and business people are getting involved in this promising piece of technology.
Why are APIs important?
APIs make prototyping a service super-fast, while they enable easier and more scalable leveraging of data. The latter can come from all sorts of sources and systems since APIs are platform-agnostic. So, if you were to create a mobile app that employs geo-location data, along with various security processes (e.g. for user authentication), you can do this easily using APIs. Also, if you have a website already for handling this sort of information exchanges, you can use an API for your target audience to interact with your online system, without even having to go to your site (the API becomes a proxy for the back-end of your site enabling them to access it through the app). For these and other reasons APIs are very important today and an essential part of any data-driven organization.
Thoughts on the "API Success" book
So, what about the "API success" book by Nelson Petracek (Technics Publications)? Well, this book covers the topic from various angles, with a strong focus on the business side of it. It provides lots of examples justifying the value-add of APIs and where they fit in in a modern organization. The book is well-written and easy to read, despite the large number of acronyms used in it. Interestingly, the book covers marketing as well, making a strong case for using APIs in a business project, be it as the main product or part of a package. It even explores how APIs can facilitate partnerships with other organizations and the fostering of long-term business relationships. The author, who is a very hands-on person, has a good sense of humor and writes in a way that's engaging and easy to follow.
The strongest part of the book, in my view, is the various architectural and design-related tips and lots of advice on the life cycle of an API, along with the corresponding diagrams that make this quite comprehensible. As for shortcomings, the lack of any hands-on material or reference resources is the only one that stands out. Nevertheless, the rest of the book makes up for this, through comprehensive coverage of the topic from various angles.
How you can get this book at a 20% discount
Although this book is available in a variety of places, you can get it at a discounted price if you go to the publisher’s site and use the coupon code DSML at the checkout. The book is already reasonably priced (around $30 for the printed version) but why not get it at a lower price? After all, this is a book with evergreen content, something you’d like to refer to again and again, maybe even share with your team when building your own APIs. Check it out!
In a previous article we talked about the value of data modeling and how it is related to data science as a field. Now let’s look at some great ways to learn more about this field.
Specifically, Technics Publications offers a few classes/workshops on data modeling this Autumn:
What’s more, you can get a 20% discount on them, if you use the coupon code DSML. You can use the same code for most of the books available on that site. Check it out!
JuliaCon stands for Julia Conference and it’s an annual educational/promotional event that Julia Computing organizes. The latter is the Massachusetts based company that manages the development and evolution of the programming language. So, JuliaCon is its way of promoting the language and keeping everyone interested in it updated on its recent developments.
JuliaCon is primarily for programmers and members of the scientific community employing Julia in their work. However, it also appeals to Julia enthusiasts and anyone interested in the ecosystem of the language, as well as its numerous applications. It’s not targeted at data scientists per se though lately there are several sessions in the conference that involve Machine Learning and A.I. since lots of people are interested in these areas. Note that most of the people involved in these packages are not professional data scientists, though some of them are familiar with the field and have written papers about it (mostly academic papers). So, if you are looking to learn about data science in this conference you may be disappointed, yet if you just wish to explore what Julia brings to the table when it comes to data science tools, you may be in for a treat.
This year several interesting things were revealed in the JuliaCon, which I attended. Namely, the Tuesday workshop on improving Julia code performance and compatibility with other programming languages was truly worth it as it covered a variety of tweaks that can make a script use less memory and/or work faster. Also, being able to incorporate Python and C code in a Julia script was something that was covered thoroughly, more than any documentation page could.
Unfortunately, some sessions weren’t properly synced and were either delayed or altogether missing from the live stream (at least on my Firefox browser). This definitely took away from the whole conference experience. Perhaps if the whole conference was done on Zoom, it would have been a smoother experience. The Q&A chat in the workshops was a nice touch though and added a lot to them.
The sessions themselves were pretty good overall, covering a variety of topics, from the more technical to the more application-oriented. They were organized in different tracks, making it easy to find the session you were most interested in. The Interactive data dashboards with Julia and Stipple session stood out. Even though it was a fairly short one, it was very relevant to data science work and with good examples, showcasing its functionality. I’d definitely recommend you watch the recording of it, which should be available by now at the Julia YouTube channel, along with the other sessions of the conference.
JuliaCon usually takes place in either the US or Europe. This year it was Europe’s turn to host the conference and it was scheduled to take place in Lisbon, Portugal. Although that laid-back Mediterranean capital would be ideal for such a conference (definitely more accessible than London, where it took place a couple of years ago), this year for the first time it took place online. This was due to the safety measures related to Covid-19 that impacted logistics severely. Anyway, if all goes well, it's expected that next year it will take place in the US. If you wish to delve more into Julia feel free to check out my books on the subject. Cheers!
It may seem that we are getting off-track here but this is highly relevant to any data scientist, particularly those on the data engineering path. Yet, as this is an overloaded term, let’s first clarify what we mean when we say data modeling as a field.
In a nutshell, data modeling is the field that deals with the design and implementation of databases, and any organization of data flows in an environment. It entails a combination of design elements such as UML diagrams, and some analytical aspects, such as code for creating and querying databases, based on certain specialized diagrams called database schemas (the image used above is one such schema, though in practice they tend to be more detailed). Data modeling professionals also deal with the cloud since many databases these days live there. Also, some data modeling experts work directly with the business and help the stakeholders of a project optimize the flow of information in the various departments of their organization, or build pipelines to better process the data at hand.
Data modelers come in different shapes and forms. From the more business-oriented ones to the more hands-on ones (e.g. DBAs), they cover a wide spectrum of roles. This is akin to the data scientists, who also are quite specialized these days. However, data modelers have been around longer so their roles are more established and more acknowledged in the business world. After all, databases have been around since the early days of computing, even if only recently have they evolved enough to be an important component in modern technologies such as big data governance and cloud computing. Also, note that most data modelers these days are involved in NoSQL, even if they are proficient in SQL-based languages. The reason is that most data today is semi-structured, something that NoSQL databases are designed for. Of course, structured data remains but usually, it's not as much nor as easy to produce.
Hopefully by now the link between the data modeling field and the data science one has started to become clear. After all, they are both data-oriented fields. The common link is databases since that's the core product of data modelers and the starting point of most data science projects. Without databases, we don't have much to work with so it's not uncommon to work with data modelers, particularly in the initial stages of a data science project. Also, data modelers have an interest in analytics so it's not uncommon for them to dabble with predictive models, e.g. in a proof-of-concept project. What's more, data modeling conferences can be a valuable educational resource for data scientists as it enables us to view parts of an organization that aren't always evident in a data science conference, where the focus is more technical in general.
Data modeling is particularly relevant if not essential to data engineers, those data scientists who specialize in the initial stages of a data science project. This involves a lot of ETL work as well as querying and augmenting databases. So, data engineers need to have a more concrete understanding of data modeling, even if it is on the more hands-on part of the field. After all, anyone can do some basic querying or table-creating, but to build an efficient and scalable database it takes much more than that.
Fortunately, nowadays it's easier than ever to learn more about data modeling. Also, you can do that without spending too much time, since the material on the field is well organized and in abundance. The fact that it's not a "sexy profession" like that of the data scientist, makes it less prone to hype and halfwits taking advantage of it through low-quality material. What's more, some publishers specialize in data modeling, such as Technics Publications. Finally, using the promo code DSML you can get a 20% discount on all the books and any webinars the publishing house offers.
Throughout our careers in data science and AI, we constantly encounter all sorts of obstacles that hinder our development. This is something inevitable, particularly when we undertake a role that's constantly evolving. However, the biggest obstacle is not something external, as one might think, but something closer to home. On the bright side, this means that it’s more within our control than anything subject to external circumstances. Let’s clarify.
The biggest obstacle is related to the limits of our aptitude, something primarily linked to our knowledge and know-how. After all, no one knows all there is to know on a subject so broad as data science (or AI). However, as we gather enough knowledge to do what we are asked to, we are overwhelmed by the idea that we know enough. Eventually, this can morph into a conviction and even expand, letting us cultivate the illusion that we know everything there is to know in our field. Naturally, nothing could be further from the truth since even a unicorn data scientist has gaps in her knowledge.
One great way to avoid this obstacle is to constantly challenge yourself in anything related to our field. I'm not talking about Kaggle competitions and other trivial things like that. After all, these are hardly as realistic as data science challenges. I'm referring to challenges to techniques and methods that you are lacking as well as refining those that you already have under your belt. This may seem simple but it's not, especially since no one enjoys becoming aware of the things he doesn't know or doesn't know fully. Perhaps that’s why developing ourselves isn’t something easy or popular.
Another way to enhance ourselves is through reading technical books related to our field. Of course, not all such books are worth your while, but if you know where to look, it's not as challenging a task. What's more, it's good to remember that the value of such a book also depends on how you process this new information. For example, in many such books, there are exercises and problems that the reader is asked to solve. By taking advantage of such opportunities, you can learn the new material better and grow a deeper understanding of the topics presented.
One way for learning more is through Technics Publications books. Although many of the books from that publishing house are related to data modeling, there are a few data science-related ones as well as a couple on AI. Of course, even the data modeling books can be useful to a data scientist, since we often need to deal with databases, particularly in the initial stages of a project. Also, if you were to buy a book from this publisher using the coupon code DSML, you can get a 20% discount. The same applies to any webinars you may register for. So, if the cost of this material is an obstacle for you, at least with this code you can alleviate it and get a bigger bang for your buck!
Normally I don't do book reviews on this blog but for this one, I thought I'd make an exception. After all, it's not every day I encounter a book that tackles topics like Logic head-on, without getting all abstract and theoretical. This book not only manages to remain practical but also gives a good overview of the topic of logic, something that every data professional can benefit from. Note that this book is on the subject of data modeling, which although related to data science, is its own field and is concerned with databases, as well as the design of such systems.
First of all, the book provides an excellent introduction to Logic, without getting too mathy about the topic. When I was looking into Ph.D. topics, I briefly considered doing my research on this subject. However, I quickly dismissed it because it was too abstract and theoretical. This book addresses this point and presents the subject in a very practical way, making it relatable and interesting. This is something it manages by providing a connection between Logic and databases, with plenty of examples. This enables the reader to maintain a practical viewpoint across the different topics covered in the book and view logic as a useful tool.
What’s more, the author does a pretty good review of other books on the subject with a robust criticism of their strengths and weaknesses. In a way, it feels like reading a bunch of books, getting the gist of their approaches, without having to go through their text. It is evident that the author knows the subject in great depth, something that he exhibits through his approach on the subject, which is also quite distinct. For example, he provides a great analysis of topics that weren't covered properly elsewhere such as that of integrity.
Also, the author provides lots of references for each topic at the end of each chapter, making the whole book feel a bit academic in that sense, but without the rigid style that characterizes such books. However, for someone who wishes to explore the various topics further, this list of relevant resources at the end of each chapter can be quite handy.
Moreover, the book is fairly easy to understand even for non-experts in data modelings or logic. This is important since it’s not common to find a technical book that’s accessible to non-experts in the topic. This book, however, seems to have a very broad audience, even people who know very little about the subject.
Finally, there are lots of definitions of key concepts and a scientific approach to the subject overall. This is also not very common since not all technical books are written by scientists. Also, many people nowadays write a book based on their experience and empirical knowledge on a subject. This book, however, was written in a scientific manner, even if it doesn't have the typical academic style.
So, if you are interested in buying this book, you can do so directly from the publisher. Also, if you were to use the code coupon DSML you can get a 20% discount, making this purchase a bargain. Note that this code applies to other books available at the Technics Publications site, including some of the webinars.
Contrary to what the image suggests, this article is about the born again shell, aka BASH. BASH is essential in any serious work on the GNU/Linux OS, which believe it or not is by far the most popular OS in the world, considering that the vast majority of servers run on it. Also, Linux-based OSes are popular among data scientists due to the versatility they offer, as well as their reliable performance.
BASH is the way you interact with the computer directly when using a GNU/Linux system. In a way it is its own programming language, one that is on par with all the low-level languages out there, such as C. BASH is ideal for ETL processes, as well as any kind of automation. For example, the Cron jobs you may set up using Crontab are based on BASH.
Of course, nothing beats a good graphical user interface (GUI) when it comes to ease-of-use, which is why most Linux-based OSes make use of a GUI too (e.g. KDE, Gnome, or Xfce). However, it's the command-line interface (CLI) that's the most powerful between the two. Also, the CLI is more or less the same in all Linux-based OSes and makes use of commands that are almost identical to those on the UNIX system. The programmatic framework of all this is naturally the BASH.
BASH is not for the fainthearted, however. Unlike most programming languages used in data science (e.g. Python, Julia, Scala, etc.), BASH is not as easy to work with as it has an old-school kind of style. However, if you are proficient in any programming language, BASH doesn't seem so daunting. Also, scripts that are written in it as super fast and can do a lot of useful tasks promptly. Yet, even without getting your hands dirty with scripting on BASH, you can do a lot of stuff using aliases, which are defined in the .bash_aliases file.
A cool thing about BASH is that you can call it through other programming environments too. In Julia, for example, you can either go directly to the shell using the semi-colon shortcut (;), or you can run a BASH command using the format run(`command`). Note that for this to work you need to use the appropriate character to define the command string (`).
Interestingly, a lot of the commands in Julia are inspired by the BASH (e.g. pwd() is a direct reference to the pwd BASH command). So, if you are comfortable with Julia, you’ll find the BASH quite accessible too. Not easy, but accessible.
Tools like BASH may not be easy to master (I’m still working on that myself, after all these years) but they can be useful even if you just know the basics. For example, it doesn’t take too much expertise to come up with a useful script that can speed up processes you normally use. For example, the RIP script I created recently can terminate up to three different processes that give you a hard time (yes, even in Linux-based OSes sometimes a process gets stuck!). Also, you can have another script to facilitate backups or even program updates and upgrades. The sky is the limit!
All this may seem irrelevant to data science and A.I. work but the more comfortable you are with low-level programming like BASH, the easier it is to build high-performance pipelines, bringing your other programs to life. After all, you can’t get closer to the metal than with BASH, without using a screwdriver!
Suppose we have an advanced AI system (e.g. an AGI) which is now proven to be safe enough to use on general-purpose scenarios. This system can do all sorts of things, including finding the optimal stance on a matter, given the data available on the subject. Fortunately, everyone who manages the corresponding databases has agreed to let this AI access and analyze this data, so long as it acknowledges the generous contributors of this data. So, data abundance is a given in this hypothetical scenario. Now, with this data and the immense computing resources this AI has at its disposal, it can sort out any controversial topic and come up with a mathematically sound solution that is valid beyond any doubt, given the data at hand. The question is would you trust this result, even if it is probably beyond your understanding, and accept this as the “right answer” to the controversial topic in question?
Let's make this more concrete. Suppose that we are dealing with a fairly realistic situation where we have a settlement in some inhospitable environment (e.g. a research center in Antarctica or on the ISS). Due to unforeseeable circumstances, there aren't sufficient resources to save everyone and everything from that place. So, someone has to decide whether they should save all the scientific samples that these people have spent years accumulating and/or analyzing, or the scientists themselves? Or perhaps a combination of the two, prioritizing senior scientists, for example. Obviously, this isn't a decision that anyone would be comfortable making, especially if that person has a conscience. However, an AI system may be more than happy to provide a solution to this problem. A clear-cut solution may be unfathomable to us but for that AI (which has access to all sorts of data, not just the data specific to the problem at hand), it's a much more feasible task. Yet, we may not like what the AI's solution is. Would we accept it nevertheless? Who shall we attribute responsibility to for this scenario?
Thinking about things like that may not help anyone gain a better understanding of the ins and outs of AI technology. However, someone could argue that solving this sort of conundrums is as important as sorting out the technical aspects of AI. After all, at one point, probably sooner rather than later, we may have to deal with real-world situations akin to this thought experiment. So, preparing ourselves for this is definitely a worthwhile task, even if it seems challenging or futile, depending on who you ask. There is no doubt that AIs help us solve all sorts of problems and we can outsource a large variety of tasks to them. Soon, an AI may be able to undertake even high-level responsibilities. It is doubtful, however, that it can act ethically if we are not able to do the same ourselves. And we don't need an AI to know that with sufficient certainty. Cheers!
Being an author has many benefits, some of which I’ve mentioned in a previous article. After all, an author (particularly a technical author) is more than just a writer. The former has undergone the scrutiny of the editing process, usually undertaken by professionals, while a writer may or may not have done the same. Also, an author has seen a writing project to its completion and has gotten a publisher to put his or her stamp of approval on that manuscript, before making it available to a larger audience. This raises the stakes significantly and adds a great deal of gravity to the book at hand.
Being an author is its own reward (even though there are other tangible rewards to it too, such as the royalties checks every few months!). However, there is a benefit that is much less obvious although it is particularly useful. Namely, an author can appreciate other authors more and learn from them. This is something that I have come to learn since my first book, yet this appreciation has reached new heights since then. This is especially the case when it comes to veteran authors who have developed more than one book.
All this leads to an urge to read more books and get more out of them. This is due to the value an author puts into these books. Instead of just a collection of words and ideas, he views a book as a sophisticated structure comprising of many layers. Even simple things like graphics take a new meaning. Of course, much of this detailed view of a book is a bit technical but the appreciation that this extra attention contributes to is something that lingers for long after the book is read.
Nevertheless, you don't need to be an author to have the same appreciation towards other people's books. This is something that grows the more you practice it and can evolve into a sense of discernment distinguishing books worth having on your bookshelf from those that you are better off leaving on the store! At the very least this ability can help you save time and money since it can help you focus on those books that have the most to offer to you.
In my experience, Technics Publications has such books worth keeping close to you, particularly if you are interested in data-related topics. This includes data science but also other disciplines like data modeling, data governance, etc. There is even a book on Blockchain, which I found very educational when I was looking into this technology, which goes beyond its cryptocurrency applications. Anyway, since good books come at a higher cost, you may want to take advantage of a special promo the publisher is doing, which gives you a 20% discount for all books, except the DMBOK ones. To get this discount, just use the DSML coupon code at the checkout (see image below).
Note that this coupon code applies to virtual classes offered by Technics Publications (i.e. the virtual training courses in the ASK series). This, however, is a topic for another article. Cheers!
Lately, I've been working with Julia a lot, partly because of my new book which was released a couple of weeks ago and partly because it's fun. If you are somewhat experienced with data handling, you'll probably familiar with the JLD and JLD2 packages that are built on top of the FileIO one. Of these, JLD2 is faster and generally better but it seems that it's no longer maintained, while the files it produces are quite bulky. This may not be an issue with a smaller dataset but when you are dealing with lots of data, it can be quite time-consuming to load or save data using this package.
Enter the CDF framework, which is short for Compressed Data Format. This is still work in progress so don’t expect miracles here! However, I made sure it can handle the most common data types, such as scalars of any type (including complex string variables), the most common arrays (including BitArrays) for up to two dimensions, and of course dictionaries containing the above data structures. It does not support yet tensors or any kind of Boolean array (which for some bizarre reason is distinct from the BitArray data structure).
CDF has two types of functions within it: those converting data into a large text string (and vice versa) and those dealing with IO operations, using binary files. The compression algorithm employed is the Deflate one from the CodecZlib package as it seems to have better performance than the other algorithms in the package. Remember that all of these algorithms take a text variable as an input, which is why everything is converted into a string before compressed and stored into a data file.
The CDF framework doesn't have many of the options JLD2 has but for the files created with it, the performance edge is quite obvious. A typical dataset can be around 20 times smaller and many times faster to read or write, as the compression algorithm is fairly fast and there is fewer data to process with the computationally heavy IO commands. As a codebase, CDF is significantly simpler than other packages while the main functions are easy as pie. In fact, the whole framework is quite intuitive that there is not really any need for documentation for it.
If you are interested in using this framework and/or extending it, feel free to contact me through this blog. Be sure to mention any affiliation(s) of yours too. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.