Benchmarking Data Science Scripts
Benchmarking is the process of measuring a script's performance in terms of the time it takes to run and memory it requires. It is an essential part of programming and it's particularly useful for developing scalable code. Usually, it involves a more detailed analysis of the code, such as profiling, so we know exactly which parts of the script are run more often and what proportion of the overall time they take. As a result, we know how we can optimize the script using this information, making it lighter and more useful.
Benchmarking is great as it allows us to optimize our scripts but what does this mean for us as data scientists? From a practical perspective, it enables us to work with larger data samples and save time. This extra time we can use for more high-level thinking, refining our work. Also, being able to develop high-performance code can make us more independent as professionals, something that has numerous advantages, especially when dealing with large scale projects. Finally, benchmarking allows us to assess the methods we use (e.g. our heuristics) and thereby make better decisions regarding them.
In Julia, in particular, there is a useful package for benchmarking, which I discovered recently through a fellow Julia user. It’s called Benchmarking Tools and it has a number of useful functions you can use for accurately measuring the performance of any script (e.g. the @btime and @benchmark macros which provide essential performance statistics). With these measures as a guide, you can easily improve the performance of a Julia script, making it more scalable. Give it a try when you get the chance.
Note that benchmarking may not be a sufficient condition for improving a script, by the way. Unless you take action to change the script, perhaps even rewrite it using a different algorithm, benchmarking can't do much. After all, the latter is more like an objective function that you try to optimize. How it changes is really up to you! This illustrates that benchmarking is really just one part of the whole editing process.
What’s more, note that benchmarking needs to be done on scripts that are free of bugs. Otherwise, it wouldn’t be possible to assess the performance of the script since it wouldn’t run to its completion. Still, you can evaluate parts of it independently, something that a functional approach to the program would enable.
Finally, it’s always good to remember this powerful methodology for script optimization. Its value in data science is beyond doubt, plus it can make programming more enjoyable. After all, for those who can appreciate elegance in a script, a piece of code can be a work of art, one that is truly valuable.
Cloud Computing is the use of external computing resources for storage and computing tasks via the internet. As for the cloud itself, it's a collection of servers dedicated to this task and are made available, usually through some paid licensing, to anyone with an internet connection. Naturally, these servers are scalable, so you can always increase the storage and computing power you lease from a cloud provider. Although the cloud was initially used for storing data, e.g., for back-up purposes, it's used for various tasks, including data science work.
There are various kinds of machines used for cloud computing, depending on the tasks outsourced to them. For starters, there are the conventional (CPU) servers, used for storing and light computation. Most websites use this cloud computing option, and it's the cheapest alternative for cases where more specialized servers are utilized. However, for small-scale data science projects, especially those employing basic data models, these servers work well.
Additionally, there are the GPU servers that are more affordable for the computational resources they provide. Although GPUs are geared towards graphics-related work (e.g., the rendering of a video), they are well-suited for AI-based models. The latter make use of a lot of computations for the training phase of their function. As more and more data becomes available, this computational cost can only increase. So, having a scalable cloud computing solution that uses this type of server is the optimal strategy for deploying such a model.
Finally, there are also servers with large amounts of RAM, like the regular servers, but with plenty of extra RAM. Such servers are ideal for any use case where lots of data is involved, and it needs to be processed in large chunks. Many data science models fall into this application category since RAM is a valuable resource when large datasets are involved. Multimedia data, in particular, requires lots of memory to be processed at a reasonable speed, even for models that don't need any training.
Cloud computing has started to dominate the data science word lately. This phenomenon is partly due to the use of all-in-one frameworks, which take care of various tasks. These frameworks usually run on the cloud since they require many resources due to the models they build and train. As a result, unless there is a reason for data science work to be undertaken in-house, it is usually outsourced on the cloud. After all, most cloud computing providers ensure high-level encryption throughout the pipeline. The presence of cybersecurity mitigates the risk of data leaks or the compromise of personally identifiable information (PII) that often exists in datasets these days.
A great place that offers cloud computing options for data science is Hostkey. Apart from the conventional servers most hosting companies offer, this one provides GPU servers too. What's more, everything is at a very affordable price tag, making this an ideal solution for the medium- and the long-term. Check out the company's website for more information. Cheers!
A functional language is a programming language that is based on the functional paradigm of coding, whereby every process of a program is a function. This allows for greater speed and mitigates the risk of bugs since it's much easier to figure out what's happening in a program as everything in it is modular. In the case of such a program, each module corresponds to a function, having its own variable space. Naturally, this helps conserve memory and make any methods developed this way more scalable.
Functional languages are very important nowadays as people are realizing that their advantages make them ideal in many performance-critical cases. Also, in cases where development speed is a factor, functional languages are preferred. It's important to remember though that many people still favor object-oriented programming (OOP) languages so the latter aren't going to go away any time soon. That's why there are lots of hybrid languages that combine elements of OOP and functional programming.
So far there have been a couple of functional languages that are relevant in data science projects. Namely, there is Scala (where Spark was developed on) and Julia, with the latter gaining popularity as more and more data science packages become available in it. Interestingly, ever since these languages have been shown to provide a performance edge (just like any other functional language), their value in data science has been undeniable, even if many data scientists prefer to use more traditional languages, such as Python.
What about the future of functional programming? Well, it seems quite promising, especially considering how many new programming languages of this paradigm exist nowadays. Also, the fact that there are new ones coming about goes to show that this way of programming is here to stay. Also, since the OOP paradigm has its advantages, it seems quite likely that newer functional languages are bound to be hybrid, to lure more practitioners who are already accustomed (and to some extent vested) in the OOP way of programming. Moreover, functional languages are bound to become more specialized since there are enough of them now to need a niche in order to stand out. In fact, some of them, as for example Julia, appear to have done just that.
If you wish to learn more about the Julia functional language and its application on data science, I have authored two books about it through the Technics Publications publishing house. Feel free to check them out here and learn more about this fascinating functional language. Cheers!
The world of data professionals is sophisticated and diverse, especially nowadays. In involves professionals whose expertise ranges from the design of data flows to databases, data analytics models, machine learning systems, and APIs that connect the users to a cloud-based solution. It's not a simple matter, while the variety and depth of all these roles leave people bewildered and uncertain about what this ecosystem is and what it can do for an organization.
We can attempt to gain an understanding of this world by reviewing the various professionals found in it. First of all, we have the data architects (aka data modelers) responsible for designing data/information flows, facilitating communication among the people in an organization, and developing the infrastructures for all movement and storage of the organization's data. They are often involved in database solutions as well as ETL processes and the creation of glossaries. Data architects are essential in an organization, mainly when there is plenty of data involved, or the data plays a vital role in the organization's workflow. Most modern organizations are like that, while the abundance of data makes these professionals necessary.
Beyond this role, there are also data analytics professionals, particularly data scientists. This sort of professionals is involved in deriving values from the available data, usually through discovering insights. Data scientists are more geared towards messy (e.g., unstructured or highly noisy) data and more advanced models. All data analytics professionals work with databases through focused querying of them, while the creation of visuals based on the data is an essential part of their pipeline. Naturally, this role involves some programming (more so in the case of data scientists) and communication with each project's stakeholders. The creation of dashboards is a typical deliverable in this role, though other kinds of data products are sometimes developed instead.
Data engineers are also an essential kind of professionals in this ecosystem. This role entails data governance, particularly when big data is involved as well as various ETL processes that facilitate data analytics work. Managing containers in the cloud and specialized software like Spark is part of these professionals' job descriptions. Data engineers are heavy on programming and often deal with computer clusters, be it physical or virtual. Their communication with the project stakeholders is relatively limited, although they liaise with data scientists quite a bit. Some data engineers are well-versed in data science methods, particularly the development and deployment of predictive models.
Finally, business intelligence (BI) folks also have a role to play in the data world. This kind of professionals involves liaising with the managers and other project stakeholders. BI professionals tend to be more knowledgeable regarding the inner workings of an organization. Simultaneously, their use of data is limited to basic models, useful graphics, and descriptions of the problem at hand. BI professionals are more related to data analysts, though they tend to be more involved in high-level tasks. Also, their use of programming is minimal.
If you want to learn more about the data professionals' world, I invite you to check out some great books, like those available at the Technics Publications' site. Although geared more towards data modeling, this publisher covers the subject quite well, providing practical knowledge from various professionals in the fields as mentioned earlier. If you use the coupon code DSML, you can get a 20% discount on any books purchased. Check it out when you have the chance. Cheers!
The Ethics of Modern NLP Systems
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.