Programming, particularly in languages like Julia, Python, and Scala, is fundamental in data science. It enables all kinds of processes, such as data engineering (particularly ETL tasks), data modeling, and even data visualization, to name a few. If you know what you are doing, you can also solve practical problems through programming (e.g., optimization tasks) by modeling them appropriately. It's a versatile tool with a lot of potential, especially once you get used to it and see it as an extension of your mind. However, it's not as simple as putting bits and pieces of programming code together. This article will examine the various strategies for coding concerning the various objectives we may have.
Let's start with the most intuitive kind of objective, namely getting things done as quickly as possible. This strategy may be suitable for solving problems that need a solution once, so the code doesn't need to be revised or reused again. This sort of strategy involves writing code that works to prove a particular concept before solving the task more seriously. Efficiency isn't pursued, nor is readability and the use of lots of comments, to explain what's happening. This strategy is typical for solving a drill or a relatively simple problem that you don't need to present to the project stakeholders. As a result, using this strategy for other scenarios is a terrible idea.
Another objective we may have when writing a script is efficiency. When we process lots of data, we don't want to have lazy code that takes a while to finish the task at hand. So, optimizing the code for efficiency through smart memory allocations, static typing, using the appropriate variable types, etc. can help with that. This programming strategy is quite a common one that can save us a lot of time. However, it's also useful when deploying this code at scale, since it means that we'll be using fewer computational resources (CPU/GPU power and memory), lowering the cost of the project at hand.
Interpretability and maintainability are a different objective altogether, tied to the final programming strategy. So, if you want your program to be easy to read and understand, making it easy to update when necessary, you opt for this strategy. It involves organizing your code to break the problem down into simple tasks handled in different classes and functions, including lots of comments explaining your reasoning and what different functions do. Naming the variables in an intuitive way is also a big plus, even if that makes the code longer at times. In any case, such code is built to last since it's easy to maintain and helps newcomers that view it adopt good practices when writing their own code.
Naturally, you can use a combination of the above strategies for your project. Not all of them play ball together, of course, but you can still make a script that's efficient and easy to understand/maintain. So, unless pressed with time, it's good to have such an approach to your programming, adapting it to each project's requirements.
If you wish to learn more about programming and how it applies to data science, you can check out one of my latest books, Julia for Machine Learning. This book explores how the Julia programming language can be used to tackle various data science problems, using machine learning models and heuristics. Accompanied by a series of examples in Jupyter notebooks and script files, it illustrates in quite comprehensible code how you can implement this framework for your data science work. So, check it out when you have a moment. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.