Strategies for Programming
Programming, particularly in languages like Julia, Python, and Scala, is fundamental in data science. It enables all kinds of processes, such as data engineering (particularly ETL tasks), data modeling, and even data visualization, to name a few. If you know what you are doing, you can also solve practical problems through programming (e.g., optimization tasks) by modeling them appropriately. It's a versatile tool with a lot of potential, especially once you get used to it and see it as an extension of your mind. However, it's not as simple as putting bits and pieces of programming code together. This article will examine the various strategies for coding concerning the various objectives we may have.
Let's start with the most intuitive kind of objective, namely getting things done as quickly as possible. This strategy may be suitable for solving problems that need a solution once, so the code doesn't need to be revised or reused again. This sort of strategy involves writing code that works to prove a particular concept before solving the task more seriously. Efficiency isn't pursued, nor is readability and the use of lots of comments, to explain what's happening. This strategy is typical for solving a drill or a relatively simple problem that you don't need to present to the project stakeholders. As a result, using this strategy for other scenarios is a terrible idea.
Another objective we may have when writing a script is efficiency. When we process lots of data, we don't want to have lazy code that takes a while to finish the task at hand. So, optimizing the code for efficiency through smart memory allocations, static typing, using the appropriate variable types, etc. can help with that. This programming strategy is quite a common one that can save us a lot of time. However, it's also useful when deploying this code at scale, since it means that we'll be using fewer computational resources (CPU/GPU power and memory), lowering the cost of the project at hand.
Interpretability and maintainability are a different objective altogether, tied to the final programming strategy. So, if you want your program to be easy to read and understand, making it easy to update when necessary, you opt for this strategy. It involves organizing your code to break the problem down into simple tasks handled in different classes and functions, including lots of comments explaining your reasoning and what different functions do. Naming the variables in an intuitive way is also a big plus, even if that makes the code longer at times. In any case, such code is built to last since it's easy to maintain and helps newcomers that view it adopt good practices when writing their own code.
Naturally, you can use a combination of the above strategies for your project. Not all of them play ball together, of course, but you can still make a script that's efficient and easy to understand/maintain. So, unless pressed with time, it's good to have such an approach to your programming, adapting it to each project's requirements.
If you wish to learn more about programming and how it applies to data science, you can check out one of my latest books, Julia for Machine Learning. This book explores how the Julia programming language can be used to tackle various data science problems, using machine learning models and heuristics. Accompanied by a series of examples in Jupyter notebooks and script files, it illustrates in quite comprehensible code how you can implement this framework for your data science work. So, check it out when you have a moment. Cheers!
Text editors are specialized programs that enable you to process text files. Although they are relatively generic, many of these text editors focus on script files, such as those used to store Julia and Python code (pretty much every programming language makes use of text files to store the source code of its scripts). So, modern text editors have evolved enough to pinpoint particular keywords in the script files and highlight them accordingly. This highlighting enables the user to understand the script better and develop a script more efficiently. Many text editors today can help pinpoint potential bugs (programming jargon for mistakes or errors), making the whole process of refining a script easier.
In data science work and work related to A.I., text editors are immensely important. They help organize the code, develop it faster and easier, optimize it, and review it. Data science scripts can often get bulky and are often interconnected, meaning that you have to keep track of several script files. Text editors make that more feasible and manageable as a task, while some can provide useful shortcuts to accelerate your workflow. Additionally, some text editors integrate with the programming languages themselves, enabling you to run the code as you develop it while keeping track of your workspace and other useful information. This is what people call an IDE, short for Integrated Development Environment, something essential for any non-trivial data science project.
One of the text editors that shine when it comes to data science work is Atom. This fairly minimalist text editor can handle various programming languages, while it also exhibits several plugins that extend its functionality. It's no wonder that it is so widely used by code-oriented professionals, including data scientists. Like most text editors out there, it's cross-platform and intuitive, while it's highly customizable and easy to learn. It's also useful for viewing text files, though you may want to look into more specialized software for huge files.
Another text editor that gained popularity recently, particularly among Julia users, is Visual Studio Code (VS Code). This text editor is much like Atom but a bit easier and slicker in its use. It has a smoother interface, while its integration of the terminal is seamless. The debug console it features is also a big plus, along with the other options it provides for trouble-shooting your scripts. Lately, it has become the go-to editor for Julia programmers, something interesting considering how vested the Julia community had been to the Atom editor and its Julia-centric version, Juno.
Beyond these two text editors, there are other ones you may want to consider. Sublime Text, for example, is noteworthy, though its full version carries a price tag. In any case, the field of text editors is quite dynamic, so it's good to be on the lookout for newer or newly-revised such software that can facilitate your scripting work.
If you want to practice coding for data science and A.I. projects, there are a few books I’ve worked on that I’d recommend. Namely, the two Julia books I've written, as well as the A.I. for Data Science book I've co-authored, are great resources for data science and A.I. related coding. Check them out when you have a moment. Cheers!
As you may already know, Julia is a functional programming language geared towards scientific computing. It is particularly useful in data science nowadays as there are many specialized libraries for this. Simultaneously, it's a fast and easy-to-work-with language, enabling you to create useful scripts for data science tasks quickly. Additionally, it's similar to Python while there are bridge packages for the two languages, making it possible to jump from one to another, leveraging code from both languages in your data science projects.
As for machine learning, this is the part of data science that deals with the creation, refining, and deployment of specialized data models, based on the data-driven approach to data analytics. It involves systems like K-means (for clustering), Support Vector Machines (for predictive analytics), and various heuristics (for specific tasks such as feature evaluation) to facilitate all kinds of data science work. Much like Statistics, it is versatile, though contrary to Stats, it doesn't rely on probabilistic reasoning and distributions for analyzing the data. That's not to say that you need to pick between the two frameworks, however. A good data scientist uses both for her work.
Julia and machine learning are a match made in heaven. Not only does Julia offer direct support for machine learning tasks (e.g., through its various packages), but it also makes it easy for a data scientist (having just basic training in the language) to write high-performance scripts for processing the data at hand. You can even use Julia just for your data engineering tasks if you are already vested in another programming language for your data models. So, it's not an "either-or" kind of choice, but more of an add-on situation. Julia can be the add-on, though once you get familiar with it, you may want to translate your whole codebase in this language for the extra performance it offers.
This prediction isn't some optimistic speculation, by the way. Julia has been evolving for the past few years at a growing rate, even though other programming languages have also been coming about. Furthermore, it has the backing of a prestigious university (MIT), while there is a worldwide community of users and Julia-specific events, such as JuliaCon, happening regularly. So, if this trend continuous, Julia is bound to remain relevant for the years to come, expanding in functionality and application areas. Naturally, if machine learning continues on its current trajectory, it will also stick around for the foreseeable future.
If you want to learn more about Julia and machine learning, especially from a practical perspective, please check out my book Julia for Machine Learning, published in the Spring of 2020. There, you can learn more about the language, explore how it's useful in machine learning, learn more about what machine learning entails and how it ties in the data science pipeline, and experiment with various heuristics not so well known (some of them are entirely original and come with the corresponding Julia code). So, check this book out when you have the chance. Cheers!
The Ethics of Web Scraping
In a nutshell, web scraping is the process of taking stuff from the web programmatically and often at scale. This involves specialized libraries as well as some understanding of how websites are structured, to separate the useful data from any markup and other stuff found on a web page.
Web scraping is very useful, especially if you are looking at updating your dataset based on a particular online source that's publicly available but doesn't have an API. The latter is something very important and popular these days, but it’s beyond the scope of this article. Feel free to check out another article I’ve written on this topic. In any case, web scraping is very popular too as it greatly facilitates data acquisition be it for building a dataset from scratch or supplementing existing datasets.
Despite its immense usefulness, web scraping is not without its limitations. These generally fall under the umbrella of ethics since they aren’t set in stone nor are they linked to legal issues. Nevertheless, they are quite serious and could potentially jeopardize a data science project if left unaddressed. So, even though these ethical rules/guidelines vary from case to case, here is a summary of the most important ones that are relevant for a typical data science project:
It’s also important to keep certain other things in mind when it comes to web scraping. Namely, since web scraping scripts tend to hang (usually due to the server they are requesting data from), it's good to save the scraped data periodically. Also, make sure you scrape all the fields (variables) you need, even if you are not sure about them. It's better to err on the plus side since you can always remove unnecessary data afterward. If you need additional fields though, or your script hangs before you save the data, you'll have to redo the web scraping from the beginning.
If you want to learn more about ethics and other non-technical aspects of data science work, I invite you to check out a book I co-authored earlier this year. Namely, the Data Scientist Bedside Manner book covers a variety of such topics, including some hands-on advice to boost your career in this field, all while maintaining an ethical mindset. Cheers!
Docker, Micro-services, and Cloud Computing in General, for Data Science Projects
Cloud computing has taken the world by storm lately, as it effectively democratized computing power and storage. This has inevitably impacted data science work too, especially when it comes to the deployment stage of the pipeline. Two of the most popular methods for accomplishing this is through containers and micro-services, both enabling your programs to run on some server somewhere with minimal overhead.
The value of this technology is immense. Apart from the cost-saving that derives from the overhead reduction, it makes the whole process easier and faster. Getting acquainted with a container program like Docker isn't more challenging than any other software a data scientist uses, while there are lots of docker images available for the vast majority of applications (including a variety of open-source OSes). Cloud computing in general is quite accessible, especially if you consider the options companies like Amazon offer.
The key value-add of all this is that a data scientist can now deploy a model or system as an application or an API on the cloud, where it can live as long as necessary, without burdening your coworkers or the company’s servers. Also, through this deployment method, you can scale up your program as needed, without having to worry about computational bandwidth and other such limitations.
Cloud computing can also be quite useful when it comes to storage. Many databases nowadays are available on the cloud, since it's much easier to store and maintain data there, while most cloud storage places have quite a decent level of cybersecurity. Also, having the data live on the cloud makes it easier to share it with the aforementioned data products deployed as Docker images, for example. In any case, such solutions are more appealing for companies today since not many of them can afford to have their own data center or any other in-house solutions.
Of course, all this is the situation today. How are things going to fare in the years to come? Given that data science projects may span for a long time (particularly if they are successful), it makes sense to think about this thoroughly before investing in it. Considering that more and more people are working remotely these days (either from home due to COVID-19, or from a remote location because of a lifestyle choice), it makes sense that cloud computing is bound to remain popular. Also, as most cloud-based solutions become available (e.g. Kubernetes), this trend is bound to continue and even expand in the foreseeable future.
Hopefully, it has become clear from all this that there are several angles to a data science project, beyond data wrangling and modeling. Unfortunately, not many people in our field try to explain this aspect of our work in a manner that's comprehensible and thorough enough. Fortunately, a fellow data scientist and I have attempted to cover this gap through one of our books: Data Scientist Bedside Manner. In it, we talk about all these topics and outline how a data science project can come into fruition. Feel free to check it out. Cheers!
SQL and Its Variants
Structured Query Language, or SQL for short, is a powerful database language geared toward structure data. As its name suggests, SQL is adept at querying databases to acquire the data you need, in a useful format. However, it includes commands that involve the creation or alteration of databases so that they can fit the requirements of your data architectural model. SQL is essential for both data scientists and other data professionals. Let's look into it more, along with its various variants.
Although the tasks performed by SQL (when it comes to data wrangling), you can perform with other programming languages too, its efficiency and relative ease of use make it a great tool. Perhaps that's why it's so popular, with many variants of it. The most well-known one, MySQL, specializes in web databases, though it can also be used for other applications. Other variants, such as PostgreSQL, are mostly geared towards the industry. All this may seem somewhat overwhelming, considering that each variant has its peculiarities. However, all of these SQL variations are similar in their structure, and it doesn't take long to get accustomed to each one of them if you already know another SQL variant.
Yet, all of the SQL databases tend to be limited in the structure of the data. In other words, if your data is semi-structured (i.e., there are elements of structure in it but it's not tabular data), you need a different kind of database. Namely, you require a NoSQL (i.e., Not Only SQL) one. Databases like MongoDB, MariaDB, etc. are of this category. Note that NoSQL databases have many commands in common with SQL, but they are geared toward a different organization of column-based data. This characteristic enables them to be faster and able to handle dictionary-like structures.
Naturally, there are plenty more kinds of SQL variants, primarily under the NoSQL paradigm. However, beyond these SQL-like databases, there are also those related to graphs, such as GraphQL. These specialized systems are geared towards storing and querying data in graph format, which is increasingly common nowadays. All these database-related matters are under the umbrella of data modeling, a field geared towards organizing data flows, optimizing the ways this data is stored, and ensuring that all the people involved (the consumers of this data) are on the same page. Although this is not strictly related to the data science field, it's imperative, plus knowing about it, enables better communication between the data architects (aka data modelers) and us.
You can learn more about SQL and other data modeling topics through the Technics Publications website. There, you can use the coupon code DSML for a 20% discount on all the titles you purchase. Note that this code may not apply to all the video material you'll find there, such as the courses offered. However, you can use it to get a discounted price for all the books. I hope you find this useful. Cheers!
Benchmarking Data Science Scripts
Benchmarking is the process of measuring a script's performance in terms of the time it takes to run and memory it requires. It is an essential part of programming and it's particularly useful for developing scalable code. Usually, it involves a more detailed analysis of the code, such as profiling, so we know exactly which parts of the script are run more often and what proportion of the overall time they take. As a result, we know how we can optimize the script using this information, making it lighter and more useful.
Benchmarking is great as it allows us to optimize our scripts but what does this mean for us as data scientists? From a practical perspective, it enables us to work with larger data samples and save time. This extra time we can use for more high-level thinking, refining our work. Also, being able to develop high-performance code can make us more independent as professionals, something that has numerous advantages, especially when dealing with large scale projects. Finally, benchmarking allows us to assess the methods we use (e.g. our heuristics) and thereby make better decisions regarding them.
In Julia, in particular, there is a useful package for benchmarking, which I discovered recently through a fellow Julia user. It’s called Benchmarking Tools and it has a number of useful functions you can use for accurately measuring the performance of any script (e.g. the @btime and @benchmark macros which provide essential performance statistics). With these measures as a guide, you can easily improve the performance of a Julia script, making it more scalable. Give it a try when you get the chance.
Note that benchmarking may not be a sufficient condition for improving a script, by the way. Unless you take action to change the script, perhaps even rewrite it using a different algorithm, benchmarking can't do much. After all, the latter is more like an objective function that you try to optimize. How it changes is really up to you! This illustrates that benchmarking is really just one part of the whole editing process.
What’s more, note that benchmarking needs to be done on scripts that are free of bugs. Otherwise, it wouldn’t be possible to assess the performance of the script since it wouldn’t run to its completion. Still, you can evaluate parts of it independently, something that a functional approach to the program would enable.
Finally, it’s always good to remember this powerful methodology for script optimization. Its value in data science is beyond doubt, plus it can make programming more enjoyable. After all, for those who can appreciate elegance in a script, a piece of code can be a work of art, one that is truly valuable.
A functional language is a programming language that is based on the functional paradigm of coding, whereby every process of a program is a function. This allows for greater speed and mitigates the risk of bugs since it's much easier to figure out what's happening in a program as everything in it is modular. In the case of such a program, each module corresponds to a function, having its own variable space. Naturally, this helps conserve memory and make any methods developed this way more scalable.
Functional languages are very important nowadays as people are realizing that their advantages make them ideal in many performance-critical cases. Also, in cases where development speed is a factor, functional languages are preferred. It's important to remember though that many people still favor object-oriented programming (OOP) languages so the latter aren't going to go away any time soon. That's why there are lots of hybrid languages that combine elements of OOP and functional programming.
So far there have been a couple of functional languages that are relevant in data science projects. Namely, there is Scala (where Spark was developed on) and Julia, with the latter gaining popularity as more and more data science packages become available in it. Interestingly, ever since these languages have been shown to provide a performance edge (just like any other functional language), their value in data science has been undeniable, even if many data scientists prefer to use more traditional languages, such as Python.
What about the future of functional programming? Well, it seems quite promising, especially considering how many new programming languages of this paradigm exist nowadays. Also, the fact that there are new ones coming about goes to show that this way of programming is here to stay. Also, since the OOP paradigm has its advantages, it seems quite likely that newer functional languages are bound to be hybrid, to lure more practitioners who are already accustomed (and to some extent vested) in the OOP way of programming. Moreover, functional languages are bound to become more specialized since there are enough of them now to need a niche in order to stand out. In fact, some of them, as for example Julia, appear to have done just that.
If you wish to learn more about the Julia functional language and its application on data science, I have authored two books about it through the Technics Publications publishing house. Feel free to check them out here and learn more about this fascinating functional language. Cheers!
What is an API?
In Computer Science, an API is short for Application Programming Interface. This is in essence a facilitator for an organization (e.g. a company) to share information with its clients and partners over the internet, oftentimes bypassing websites. And API is designed for computer programs so it’s usually developers that deal with this tech, though many data scientists and business people are getting involved in this promising piece of technology.
Why are APIs important?
APIs make prototyping a service super-fast, while they enable easier and more scalable leveraging of data. The latter can come from all sorts of sources and systems since APIs are platform-agnostic. So, if you were to create a mobile app that employs geo-location data, along with various security processes (e.g. for user authentication), you can do this easily using APIs. Also, if you have a website already for handling this sort of information exchanges, you can use an API for your target audience to interact with your online system, without even having to go to your site (the API becomes a proxy for the back-end of your site enabling them to access it through the app). For these and other reasons APIs are very important today and an essential part of any data-driven organization.
Thoughts on the "API Success" book
So, what about the "API success" book by Nelson Petracek (Technics Publications)? Well, this book covers the topic from various angles, with a strong focus on the business side of it. It provides lots of examples justifying the value-add of APIs and where they fit in in a modern organization. The book is well-written and easy to read, despite the large number of acronyms used in it. Interestingly, the book covers marketing as well, making a strong case for using APIs in a business project, be it as the main product or part of a package. It even explores how APIs can facilitate partnerships with other organizations and the fostering of long-term business relationships. The author, who is a very hands-on person, has a good sense of humor and writes in a way that's engaging and easy to follow.
The strongest part of the book, in my view, is the various architectural and design-related tips and lots of advice on the life cycle of an API, along with the corresponding diagrams that make this quite comprehensible. As for shortcomings, the lack of any hands-on material or reference resources is the only one that stands out. Nevertheless, the rest of the book makes up for this, through comprehensive coverage of the topic from various angles.
How you can get this book at a 20% discount
Although this book is available in a variety of places, you can get it at a discounted price if you go to the publisher’s site and use the coupon code DSML at the checkout. The book is already reasonably priced (around $30 for the printed version) but why not get it at a lower price? After all, this is a book with evergreen content, something you’d like to refer to again and again, maybe even share with your team when building your own APIs. Check it out!
Highlights of JuliaCon 2020
JuliaCon stands for Julia Conference and it’s an annual educational/promotional event that Julia Computing organizes. The latter is the Massachusetts based company that manages the development and evolution of the programming language. So, JuliaCon is its way of promoting the language and keeping everyone interested in it updated on its recent developments.
JuliaCon is primarily for programmers and members of the scientific community employing Julia in their work. However, it also appeals to Julia enthusiasts and anyone interested in the ecosystem of the language, as well as its numerous applications. It’s not targeted at data scientists per se though lately there are several sessions in the conference that involve Machine Learning and A.I. since lots of people are interested in these areas. Note that most of the people involved in these packages are not professional data scientists, though some of them are familiar with the field and have written papers about it (mostly academic papers). So, if you are looking to learn about data science in this conference you may be disappointed, yet if you just wish to explore what Julia brings to the table when it comes to data science tools, you may be in for a treat.
This year several interesting things were revealed in the JuliaCon, which I attended. Namely, the Tuesday workshop on improving Julia code performance and compatibility with other programming languages was truly worth it as it covered a variety of tweaks that can make a script use less memory and/or work faster. Also, being able to incorporate Python and C code in a Julia script was something that was covered thoroughly, more than any documentation page could.
Unfortunately, some sessions weren’t properly synced and were either delayed or altogether missing from the live stream (at least on my Firefox browser). This definitely took away from the whole conference experience. Perhaps if the whole conference was done on Zoom, it would have been a smoother experience. The Q&A chat in the workshops was a nice touch though and added a lot to them.
The sessions themselves were pretty good overall, covering a variety of topics, from the more technical to the more application-oriented. They were organized in different tracks, making it easy to find the session you were most interested in. The Interactive data dashboards with Julia and Stipple session stood out. Even though it was a fairly short one, it was very relevant to data science work and with good examples, showcasing its functionality. I’d definitely recommend you watch the recording of it, which should be available by now at the Julia YouTube channel, along with the other sessions of the conference.
JuliaCon usually takes place in either the US or Europe. This year it was Europe’s turn to host the conference and it was scheduled to take place in Lisbon, Portugal. Although that laid-back Mediterranean capital would be ideal for such a conference (definitely more accessible than London, where it took place a couple of years ago), this year for the first time it took place online. This was due to the safety measures related to Covid-19 that impacted logistics severely. Anyway, if all goes well, it's expected that next year it will take place in the US. If you wish to delve more into Julia feel free to check out my books on the subject. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.