Code notebooks have become a necessity for anyone doing any serious work in this field. Although they don't have the same functionality as IDEs, they are instrumental, especially if you want to showcase your code. Also, it's something that has evolved quite a bit over the years. This notebook covered in this article is seemingly the latest development of this evolution, at least for the Julia programming language, in data science work.
Don’t You Mean Jupyter Notebooks?
I understand why someone would think that. After all, Jupyter notebooks are well-established, and I've often used them myself over the years. However, this code notebook I write about is an entirely different animal (you could say another world altogether!). Neptune notebooks have little to do with Jupyter ones, so if you go to them expecting them to be just like Jupyter but better, you'd be disappointed. However, if you see them for what they are, you may be in for a surprise.
A Voyage through the Solar System of Code Notebooks
Neptune notebooks are essentially Julia scripts rendered on a web browser. At the time of this writing, this browser is usually Chrome or some variant of it (e.g., Chromium) and Firefox. However, The latter browser isn't ideal for particular tasks, such as printing (even if it's PDF printing).
A Neptune notebook can run on Julia, even if you don't have the Neptune.jl library installed. However, if you do, you can load the notebook on your browser and have Julia running in the background, just like in the Jupyter notebooks. However, unlike its more established brother, the Neptune notebook only supports Julia, particularly the later versions of the language. Also, the layout is quite different, and at first, it may seem off-putting if you are used to the elegance and refined interface of Jupyter notebooks.
Neptune notebooks are rudimentary, perhaps even minimalist, compared to Jupyter ones. However, because of this, they are far more stable and efficient. In fact, in the few months that I've been using Neptune notebooks, I've never had it crash on me, not once. Also, authentication errors are rare and only happen if you try to run a Neptune notebook on both Firefox and Chromium simultaneously. The notebook seems to lock onto the browser you start it in, usually the default browser. Wrinkles like this will hopefully be ironed out in future versions of the Neptune library. Despite them, however, these notebooks are quite slick and their support of markdown is a noteworthy alternative to the text cells of Jupyter notebooks. Perhaps overall, they are geared towards the more advanced users at the time of this writing. Hopefully, this is something that will change if more data scientists embrace this technology.
By the way, technically Neptune is a form of the Pluto.jl library, which enables Pluto notebooks. However, the latter, although quite interesting, aren't designed for data science work and I'd not recommend them. If, however, you are a Julia programmer who wants to use something different, Pluto is a good alternative. Just don't try to do a data science project on them before getting some insurance policy on your computer, since it's likely you are going to physically break it, out of frustration!
The development of code notebooks is fascinating, and Neptune seems to be a respectable addition to all this. As there aren't any decent tutorials at the moment that I can point you to, I suggest you play around with such a notebook yourself to see if it does it for you. If you want to see one such code notebook in action, you can check out my Anonymization and Pseudonymization course I've published relatively recently on WintellectNow. All the coding in it is on a Neptune notebook, which is the hands-on part of that mini-course. Cheers.
This topic may seem a bit strange, but I'm running out of ideas here! Still, it's interesting how often this topic comes about in mentoring sessions, especially when dealing with A/B testing. So, if you can't answer the question "when are two numbers equal enough?" in a simple sentence, perhaps you'll have something to learn from this article.
First of all, the rationale of all this. Sometimes, we need to make an executive decision about whether we should apply this or the other function on the data at hand. In A/B testing, this is usually something like “should we go for the equal variances or the unequal variances variant of the T-test?” Of course, when you have two samples, the chances of their variances being exactly equal is minuscule, so why did those old sages of Stats whom we revere so much decide to have two variants of the T-test, based on the equality of the variances involved? Well, there is a different formula used since if the variances are the same, things are much simpler with the underlying math. But then the question becomes "when are these two variances equal?" and keep in mind that we are talking Stats here, so the rigidity of Math as we know it doesn't apply. We are comfortable with approximations, otherwise, we'd have to abandon the whole idea of Statistics altogether!
In engineering, two numbers are equal when their difference is within a tolerance margin. We usually depict this tolerance by a threshold th expressed as a negative power of ten. So, often we have something like th = 10^(-3), which is a fancy way of saying th = 0.001. This kind of approximation, although very handy, may not apply to the problem at hand. Besides, few disciplines have the scientific reasoning and discipline that Engineering exhibits, and Stats is not one of them. Also, let's not forget that traditional Computer Science is akin to Engineering, so the approx() function found in many languages follows a similar motif, making it inapplicable to the problem mentioned previously.
In Physics, things are a bit different, which is why often we talk about orders of magnitude. So, it's often the case that if two quantities A and B are different by at least an order of magnitude, they are much different. This is another way of saying that one is at least ten times bigger than the other. This is something we can apply to our problem since it gives us a relative rule of thumb to work with. Of course, an order of magnitude is quite a bit when we talk about variances, but we can adapt this to something that makes more sense in Analytics work.
So, what about a fixed percentage, maybe one order of magnitude less than 1? This would translate into 10% (since 1 = 100%), something that's not too much but not negligible either. So, if v1 and v2 are the two variances at hand, we can say that if v1 <= (1+10%)v2 and v2 <= (1+10%)v1, we can presume v1 and v2 to be more or less equal. Additionally, this wouldn't work if one of them is 0, in which case the two variances would always be considered different from each other. Then again, this makes intuitive sense since we'd be dealing with a static variable and one that varies at least a bit. Also, as things are made simpler if we use as a reference point the smaller variance, we can just do a single comparison and be done with it. After all, if v2 is the smallest and v1 <= 1.1*v2, we can be sure that the reverse would also hold true.
In other words, we can use a script like the one attached to this article and not have to worry about this matter much (note that this script allows us to use a different threshold too, other than 0.1). Cheers!
The latter has been something I've been looking into for a while now. However, my skill-set hasn't been accommodating for this until recently, when I started working with GUIs for shell scripting. So, if you have a Linux-based OS, you can now use a GUI for a couple of methods in the Thunderstorm system. Well, given I'll release the code for it someday.
Alright, enough with the drama. This blog isn't FB or some other overly sensational platform. However, if you've been following my work since the old days, you may be aware of the fact that I've developed a nifty cipher called Thunderstorm. But that's been around for years, right? Well, yes, but now it's becoming even more intriguing. Let's see how and why this may be relevant to someone in a data-related discipline like ours.
First of all, the code base of Thunderstorm has been refactored significantly since the last time I wrote about it. These days, it features ten script files, some of which are relevant in data science work, too (e.g., ectropy_lite.jl) or even simulation experiments (e.g., random.jl, the script, not the package!). One of the newest additions to this project is a simple key generation stream (keygen) based on a password. Although this is not true randomness, it's relatively robust in the sense that no repeating patterns have emerged in any of the experiments on the files it produced. Some of the key files were several MB in size. So, even though these keys are not as strong as something made using true randomness (a TRNG method), they are still random enough for cryptographic tasks.
What's super interesting (at least to me and maybe some open-minded cryptographers) is a new method I put together that allows you to refresh a given key file. Naturally, the latter would be something employing true randomness, but the particular function would work for any file. This script, which I imaginatively named keys.jl, is one I've developed a GUI for too.
Although I doubt I'll make Thunderstorm open-source in the foreseeable future (partly because most people are still not aware of its value-add in the quantum era we are in), I plan to keep working on it. Maybe even build more GUIs for the various methods it has. The bench-marking I did a couple of months back was very promising for all of its variants (yes, there are variants of the cipher method now), so that's nice.
In any case, it's good to protect your data files in whatever way you can. What better way than a cipher for doing this, especially if PII is involved? The need for protecting sensitive data increases further if you need to share it across insecure channels, like most web-based platforms. Also, even if something is encrypted, lots of metadata from it can spill over since the encrypted file's size is generally the same as that of the original file. Well, that's not the case with the original version of Thunderstorm, which tinkers with that aspect of the data too. So, even metadata mining isn't all that useful if a data file is encrypted with the Thunderstorm cipher.
I could write about this topic until the cows come home, so I’ll stop now. Stay tuned for more updates on this cryptographic system (aka cryptosystem) geared towards confidentiality. In the meantime, feel free to check out my Cybersecurity-related material on WintellectNow, for more background information on this subject. Cheers!
What Rust is
I may have mentioned Rust in the past, but now I’d like to talk more about it and its role in data science and A.I., as it has passed the test of time, in my view. After having delved into Rust programming a bit, enough to understand that it's much more challenging than I realized at first, I believe I can now write about it with confidence. Also, since it's not so new to me, I'm way past the infatuation stage that characterizes most people who have talked or written about it, usually shortly after they started exploring it.
So, Rust is a high-performance language, currently in version 1.51, and with a large enough community of users (and companies) to make a dent in the programming realm. There is even a Rust track in the Exercism platform, where there are dedicated mentors who can help you learn it through the carefully designed and curated programming drills on the Exercism website. What's more, there are a few interesting books on Rust, while there are also conferences and workshops for anyone serious about this language.
Rust’s key strengths
Rust isn't popular because of its particular name or its cool logo, though. Rust earned its popularity through the strengths it brings to the table and the value-adds that accompany its deployment. First of all, it's high-performance, meaning that you can use it instead of C, C++, or even Java. That's not an easy-to-accomplish thing, and few languages have accomplished that. Also, it offers this performance while maintaining a relatively high-level approach to programming, much like most modern languages that come about.
Additionally, Rust is reliable and as safe as it gets. Many consider it to be better in that respect than even C, which has a series of memory management issues resulting in risky code. So, if you want to build a program that just works and won't make you sleep with your phone on at night (in case you'll need to fix an issue of a script you've shipped), Rust is a good option.
Finally, Rust is geared towards productivity. It's not an academic language or something a bunch of hobbyists put together, far from that. Rust is built for devs and people who are dead serious about designing and deploying software. The language's well-written documentation adds to this. At the same time, its error messages, although frustrating at first, give you some actual insight as to what's wrong with your scripts (instead of some generic error message that's more of a puzzle than any real help for debugging your code).
Rust and Data Science
When it comes to data science work, particularly machine learning and AI-related tasks, Rust has the potential of being a great asset. I say this, even though I'm vested in another high-performance language, Julia, for which I've written extensively (my books on Julia) and continue to use up to this day. However, unlike those fanboys of this or the other data science language, I'm open to new possibilities, which I'm always eager to explore. So, even though I'm a long way from being a Rust veteran, I can see its merit in our field.
So far, there are a few Rust packages for ML work, such as Smartcore and Linfa (plant juice in Italian), though, in all fairness, this codebase is nowhere near the variety and maturity of the likes of Scikit-learn in Python and the packages in the Julia ecosystem. Still, there is a lot of value Rust offers in this space, and as the community grows, we should be expecting to see the ML and A.I. libraries of Rust grow both in number and sophistication.
It may seem a bit too early to tell, but it's not far-fetched to say that Rust is here to stay and make it. While high-level languages like Python had nothing more to offer than simplicity and ease-of-use (probably the main reason they made it to the data science world), Rust is closer to modern languages like Julia and Nim, which offer a serious performance boost. Its business proposition is unquestionable, its adoption higher than many people expected, and its potential of making a dent in machine learning is hard to contest. Once you get past its eccentric programming style, you may begin to view it with the respect and fondness it deserves. So, check it out when you have a moment. Cheers!
Programming, particularly in languages like Julia, Python, and Scala, is fundamental in data science. It enables all kinds of processes, such as data engineering (particularly ETL tasks), data modeling, and even data visualization, to name a few. If you know what you are doing, you can also solve practical problems through programming (e.g., optimization tasks) by modeling them appropriately. It's a versatile tool with a lot of potential, especially once you get used to it and see it as an extension of your mind. However, it's not as simple as putting bits and pieces of programming code together. This article will examine the various strategies for coding concerning the various objectives we may have.
Let's start with the most intuitive kind of objective, namely getting things done as quickly as possible. This strategy may be suitable for solving problems that need a solution once, so the code doesn't need to be revised or reused again. This sort of strategy involves writing code that works to prove a particular concept before solving the task more seriously. Efficiency isn't pursued, nor is readability and the use of lots of comments, to explain what's happening. This strategy is typical for solving a drill or a relatively simple problem that you don't need to present to the project stakeholders. As a result, using this strategy for other scenarios is a terrible idea.
Another objective we may have when writing a script is efficiency. When we process lots of data, we don't want to have lazy code that takes a while to finish the task at hand. So, optimizing the code for efficiency through smart memory allocations, static typing, using the appropriate variable types, etc. can help with that. This programming strategy is quite a common one that can save us a lot of time. However, it's also useful when deploying this code at scale, since it means that we'll be using fewer computational resources (CPU/GPU power and memory), lowering the cost of the project at hand.
Interpretability and maintainability are a different objective altogether, tied to the final programming strategy. So, if you want your program to be easy to read and understand, making it easy to update when necessary, you opt for this strategy. It involves organizing your code to break the problem down into simple tasks handled in different classes and functions, including lots of comments explaining your reasoning and what different functions do. Naming the variables in an intuitive way is also a big plus, even if that makes the code longer at times. In any case, such code is built to last since it's easy to maintain and helps newcomers that view it adopt good practices when writing their own code.
Naturally, you can use a combination of the above strategies for your project. Not all of them play ball together, of course, but you can still make a script that's efficient and easy to understand/maintain. So, unless pressed with time, it's good to have such an approach to your programming, adapting it to each project's requirements.
If you wish to learn more about programming and how it applies to data science, you can check out one of my latest books, Julia for Machine Learning. This book explores how the Julia programming language can be used to tackle various data science problems, using machine learning models and heuristics. Accompanied by a series of examples in Jupyter notebooks and script files, it illustrates in quite comprehensible code how you can implement this framework for your data science work. So, check it out when you have a moment. Cheers!
Text editors are specialized programs that enable you to process text files. Although they are relatively generic, many of these text editors focus on script files, such as those used to store Julia and Python code (pretty much every programming language makes use of text files to store the source code of its scripts). So, modern text editors have evolved enough to pinpoint particular keywords in the script files and highlight them accordingly. This highlighting enables the user to understand the script better and develop a script more efficiently. Many text editors today can help pinpoint potential bugs (programming jargon for mistakes or errors), making the whole process of refining a script easier.
In data science work and work related to A.I., text editors are immensely important. They help organize the code, develop it faster and easier, optimize it, and review it. Data science scripts can often get bulky and are often interconnected, meaning that you have to keep track of several script files. Text editors make that more feasible and manageable as a task, while some can provide useful shortcuts to accelerate your workflow. Additionally, some text editors integrate with the programming languages themselves, enabling you to run the code as you develop it while keeping track of your workspace and other useful information. This is what people call an IDE, short for Integrated Development Environment, something essential for any non-trivial data science project.
One of the text editors that shine when it comes to data science work is Atom. This fairly minimalist text editor can handle various programming languages, while it also exhibits several plugins that extend its functionality. It's no wonder that it is so widely used by code-oriented professionals, including data scientists. Like most text editors out there, it's cross-platform and intuitive, while it's highly customizable and easy to learn. It's also useful for viewing text files, though you may want to look into more specialized software for huge files.
Another text editor that gained popularity recently, particularly among Julia users, is Visual Studio Code (VS Code). This text editor is much like Atom but a bit easier and slicker in its use. It has a smoother interface, while its integration of the terminal is seamless. The debug console it features is also a big plus, along with the other options it provides for trouble-shooting your scripts. Lately, it has become the go-to editor for Julia programmers, something interesting considering how vested the Julia community had been to the Atom editor and its Julia-centric version, Juno.
Beyond these two text editors, there are other ones you may want to consider. Sublime Text, for example, is noteworthy, though its full version carries a price tag. In any case, the field of text editors is quite dynamic, so it's good to be on the lookout for newer or newly-revised such software that can facilitate your scripting work.
If you want to practice coding for data science and A.I. projects, there are a few books I’ve worked on that I’d recommend. Namely, the two Julia books I've written, as well as the A.I. for Data Science book I've co-authored, are great resources for data science and A.I. related coding. Check them out when you have a moment. Cheers!
As you may already know, Julia is a functional programming language geared towards scientific computing. It is particularly useful in data science nowadays as there are many specialized libraries for this. Simultaneously, it's a fast and easy-to-work-with language, enabling you to create useful scripts for data science tasks quickly. Additionally, it's similar to Python while there are bridge packages for the two languages, making it possible to jump from one to another, leveraging code from both languages in your data science projects.
As for machine learning, this is the part of data science that deals with the creation, refining, and deployment of specialized data models, based on the data-driven approach to data analytics. It involves systems like K-means (for clustering), Support Vector Machines (for predictive analytics), and various heuristics (for specific tasks such as feature evaluation) to facilitate all kinds of data science work. Much like Statistics, it is versatile, though contrary to Stats, it doesn't rely on probabilistic reasoning and distributions for analyzing the data. That's not to say that you need to pick between the two frameworks, however. A good data scientist uses both for her work.
Julia and machine learning are a match made in heaven. Not only does Julia offer direct support for machine learning tasks (e.g., through its various packages), but it also makes it easy for a data scientist (having just basic training in the language) to write high-performance scripts for processing the data at hand. You can even use Julia just for your data engineering tasks if you are already vested in another programming language for your data models. So, it's not an "either-or" kind of choice, but more of an add-on situation. Julia can be the add-on, though once you get familiar with it, you may want to translate your whole codebase in this language for the extra performance it offers.
This prediction isn't some optimistic speculation, by the way. Julia has been evolving for the past few years at a growing rate, even though other programming languages have also been coming about. Furthermore, it has the backing of a prestigious university (MIT), while there is a worldwide community of users and Julia-specific events, such as JuliaCon, happening regularly. So, if this trend continuous, Julia is bound to remain relevant for the years to come, expanding in functionality and application areas. Naturally, if machine learning continues on its current trajectory, it will also stick around for the foreseeable future.
If you want to learn more about Julia and machine learning, especially from a practical perspective, please check out my book Julia for Machine Learning, published in the Spring of 2020. There, you can learn more about the language, explore how it's useful in machine learning, learn more about what machine learning entails and how it ties in the data science pipeline, and experiment with various heuristics not so well known (some of them are entirely original and come with the corresponding Julia code). So, check this book out when you have the chance. Cheers!
In a nutshell, web scraping is the process of taking stuff from the web programmatically and often at scale. This involves specialized libraries as well as some understanding of how websites are structured, to separate the useful data from any markup and other stuff found on a web page.
Web scraping is very useful, especially if you are looking at updating your dataset based on a particular online source that's publicly available but doesn't have an API. The latter is something very important and popular these days, but it’s beyond the scope of this article. Feel free to check out another article I’ve written on this topic. In any case, web scraping is very popular too as it greatly facilitates data acquisition be it for building a dataset from scratch or supplementing existing datasets.
Despite its immense usefulness, web scraping is not without its limitations. These generally fall under the umbrella of ethics since they aren’t set in stone nor are they linked to legal issues. Nevertheless, they are quite serious and could potentially jeopardize a data science project if left unaddressed. So, even though these ethical rules/guidelines vary from case to case, here is a summary of the most important ones that are relevant for a typical data science project:
It’s also important to keep certain other things in mind when it comes to web scraping. Namely, since web scraping scripts tend to hang (usually due to the server they are requesting data from), it's good to save the scraped data periodically. Also, make sure you scrape all the fields (variables) you need, even if you are not sure about them. It's better to err on the plus side since you can always remove unnecessary data afterward. If you need additional fields though, or your script hangs before you save the data, you'll have to redo the web scraping from the beginning.
If you want to learn more about ethics and other non-technical aspects of data science work, I invite you to check out a book I co-authored earlier this year. Namely, the Data Scientist Bedside Manner book covers a variety of such topics, including some hands-on advice to boost your career in this field, all while maintaining an ethical mindset. Cheers!
Cloud computing has taken the world by storm lately, as it effectively democratized computing power and storage. This has inevitably impacted data science work too, especially when it comes to the deployment stage of the pipeline. Two of the most popular methods for accomplishing this is through containers and micro-services, both enabling your programs to run on some server somewhere with minimal overhead.
The value of this technology is immense. Apart from the cost-saving that derives from the overhead reduction, it makes the whole process easier and faster. Getting acquainted with a container program like Docker isn't more challenging than any other software a data scientist uses, while there are lots of docker images available for the vast majority of applications (including a variety of open-source OSes). Cloud computing in general is quite accessible, especially if you consider the options companies like Amazon offer.
The key value-add of all this is that a data scientist can now deploy a model or system as an application or an API on the cloud, where it can live as long as necessary, without burdening your coworkers or the company’s servers. Also, through this deployment method, you can scale up your program as needed, without having to worry about computational bandwidth and other such limitations.
Cloud computing can also be quite useful when it comes to storage. Many databases nowadays are available on the cloud, since it's much easier to store and maintain data there, while most cloud storage places have quite a decent level of cybersecurity. Also, having the data live on the cloud makes it easier to share it with the aforementioned data products deployed as Docker images, for example. In any case, such solutions are more appealing for companies today since not many of them can afford to have their own data center or any other in-house solutions.
Of course, all this is the situation today. How are things going to fare in the years to come? Given that data science projects may span for a long time (particularly if they are successful), it makes sense to think about this thoroughly before investing in it. Considering that more and more people are working remotely these days (either from home due to COVID-19, or from a remote location because of a lifestyle choice), it makes sense that cloud computing is bound to remain popular. Also, as most cloud-based solutions become available (e.g. Kubernetes), this trend is bound to continue and even expand in the foreseeable future.
Hopefully, it has become clear from all this that there are several angles to a data science project, beyond data wrangling and modeling. Unfortunately, not many people in our field try to explain this aspect of our work in a manner that's comprehensible and thorough enough. Fortunately, a fellow data scientist and I have attempted to cover this gap through one of our books: Data Scientist Bedside Manner. In it, we talk about all these topics and outline how a data science project can come into fruition. Feel free to check it out. Cheers!
Structured Query Language, or SQL for short, is a powerful database language geared toward structure data. As its name suggests, SQL is adept at querying databases to acquire the data you need, in a useful format. However, it includes commands that involve the creation or alteration of databases so that they can fit the requirements of your data architectural model. SQL is essential for both data scientists and other data professionals. Let's look into it more, along with its various variants.
Although the tasks performed by SQL (when it comes to data wrangling), you can perform with other programming languages too, its efficiency and relative ease of use make it a great tool. Perhaps that's why it's so popular, with many variants of it. The most well-known one, MySQL, specializes in web databases, though it can also be used for other applications. Other variants, such as PostgreSQL, are mostly geared towards the industry. All this may seem somewhat overwhelming, considering that each variant has its peculiarities. However, all of these SQL variations are similar in their structure, and it doesn't take long to get accustomed to each one of them if you already know another SQL variant.
Yet, all of the SQL databases tend to be limited in the structure of the data. In other words, if your data is semi-structured (i.e., there are elements of structure in it but it's not tabular data), you need a different kind of database. Namely, you require a NoSQL (i.e., Not Only SQL) one. Databases like MongoDB, MariaDB, etc. are of this category. Note that NoSQL databases have many commands in common with SQL, but they are geared toward a different organization of column-based data. This characteristic enables them to be faster and able to handle dictionary-like structures.
Naturally, there are plenty more kinds of SQL variants, primarily under the NoSQL paradigm. However, beyond these SQL-like databases, there are also those related to graphs, such as GraphQL. These specialized systems are geared towards storing and querying data in graph format, which is increasingly common nowadays. All these database-related matters are under the umbrella of data modeling, a field geared towards organizing data flows, optimizing the ways this data is stored, and ensuring that all the people involved (the consumers of this data) are on the same page. Although this is not strictly related to the data science field, it's imperative, plus knowing about it, enables better communication between the data architects (aka data modelers) and us.
You can learn more about SQL and other data modeling topics through the Technics Publications website. There, you can use the coupon code DSML for a 20% discount on all the titles you purchase. Note that this code may not apply to all the video material you'll find there, such as the courses offered. However, you can use it to get a discounted price for all the books. I hope you find this useful. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.