The usefulness of JSON lies in the fact that it's versatile and relatively concise. What's more, it's faster than other similar file formats, while it's already widely used for web-related applications, making it easy to find mature programming libraries for it. Moreover, JSON is very intuitive, and many text editors have built-in functionality for viewing such files in an easy-to-read way. Furthermore, it's easy to create and edit JSON files yourself using a text editor, while programmatically, it's a walk in the park.
JSON’s compatibility with NoSQL databases is one of its fortes. Such systems include databases like MongoDB, which are quite popular in data science. Most new databases are also compatible with JSON as it's become a kind of standard. Additionally, JSON and the dictionary data structure go hand-in-hand, something vital in data science work. So, if you want to load some data from a JSON file, you can store it in a dictionary, while if you have a dataset (any dataset), you can code it as a dictionary (each variable being a key) and store it as a JSON file.
The JSON.jl library in Julia is one worth knowing about, especially if you want to use this programming language in your data science work. This fairly simple package enables you to parse and create JSON files, using the primitive Dict structure. A convenient library to know, even if it's still in version 0.21.x. JSON.jl makes use of the FileIO package on the back-end and its most useful functions are parse(), parsefile(), and print(). Note that the latter works different data structures, not just dictionaries.
The JSON file format is closely linked to APIs too. The latter are particularly useful in various data-related applications and are instrumental in certain data products developed by data scientists. Also, many APIs are essential for acquiring data, so knowing about them goes without saying. APIs are ideal for proof-of-concept projects, too, as they don't require too much work to get one up-and-running. As a result, they are a versatile tool for all sorts of projects, particularly those with a web presence.
The API Success book describes this technology in sufficient depth, without getting too technical. Besides, if you understand APIs' usefulness and how they fit into the bigger picture, it's not too hard to learn the technical aspects too, through a tutorial, for example. Note that you can get a 20% discount on this and any other book available at the publisher's website using the coupon code DSML. Using this code will also help me out, so you can see it as a way to support this blog. Cheers!
Delimited files are specialized files for storing structured (tabular) data. They are uncompressed and easy to parse through a variety of programs, particularly programming languages. Delimited files are widely used in data analytics projects, such as those related to data science. But what are they about, and why are they so popular?
There are various types of delimited files, all of which have their use cases. The most common ones are commas separated variables (CSV) files and tab-separated variables (TSV). However, any character can be used to separate the various values of the variables, such as the pipe (|) or the semicolon (;). All of the delimited files are similar, though, in the sense that they contain raw data organized through the use of a delimiter (the character as mentioned earlier). Note that the variable names are included in the delimited file in many cases, usually in the top row. Like the rest of the file, that row also has its values (in this case, the variables) separated by the delimiter.
Delimited files are super useful in data science work and data analytics work in general. They are straightforward to produce (e.g., most software has the "export as CSV" option available) and easy to access since they are, in essence, just text files. Also, every data science programming language has a library for loading and saving data in this file format. The fact that many datasets are available in this format is a consequence of that. Additionally, if a delimited file is corrupt, it's relatively manageable to pinpoint the problem and correct it. Yet, even if the problem is unfeasible to remedy, you can still access the healthy part of the file and retrieve the data there. In the case of specialized data files involving compression, this isn't possible most of the time.
In Julia, there is a CSV library called CSV.jl (very imaginative name, I know!). Although its functionality is relatively basic, it is super fast (much faster than the Python equivalent), while it has excellent documentation. Despite what the name suggests, this library can be used for delimited files, not just CSVs. The parameter "delim" is responsible for this, though you don't need to set it always, since the corresponding function can figure out the delimiter character on its own, most of the time.
Delimited files are instrumental in data science work, but they aren't always the best option. For example, in cases when you are dealing with semi-structured or unstructured data, it's best to use a different format for your data files, such as JSON. Also, suppose you are working with NoSQL databases. In that case, delimited files may not be useful at all, since the data in those databases are best suited for a dictionary data structure, such as that provided by JSON and XML files.
If you want to learn more about this topic and other topics related to data science work, feel free to check out my book Data Science Mindset, Methodologies, and Misconceptions. In this book, I talk about all data science-related matters, including data structures and such, providing a good overview of all the material related to this fascinating field. So, check it out when you have the chance. Cheers!
Structured Query Language, or SQL for short, is a powerful database language geared toward structure data. As its name suggests, SQL is adept at querying databases to acquire the data you need, in a useful format. However, it includes commands that involve the creation or alteration of databases so that they can fit the requirements of your data architectural model. SQL is essential for both data scientists and other data professionals. Let's look into it more, along with its various variants.
Although the tasks performed by SQL (when it comes to data wrangling), you can perform with other programming languages too, its efficiency and relative ease of use make it a great tool. Perhaps that's why it's so popular, with many variants of it. The most well-known one, MySQL, specializes in web databases, though it can also be used for other applications. Other variants, such as PostgreSQL, are mostly geared towards the industry. All this may seem somewhat overwhelming, considering that each variant has its peculiarities. However, all of these SQL variations are similar in their structure, and it doesn't take long to get accustomed to each one of them if you already know another SQL variant.
Yet, all of the SQL databases tend to be limited in the structure of the data. In other words, if your data is semi-structured (i.e., there are elements of structure in it but it's not tabular data), you need a different kind of database. Namely, you require a NoSQL (i.e., Not Only SQL) one. Databases like MongoDB, MariaDB, etc. are of this category. Note that NoSQL databases have many commands in common with SQL, but they are geared toward a different organization of column-based data. This characteristic enables them to be faster and able to handle dictionary-like structures.
Naturally, there are plenty more kinds of SQL variants, primarily under the NoSQL paradigm. However, beyond these SQL-like databases, there are also those related to graphs, such as GraphQL. These specialized systems are geared towards storing and querying data in graph format, which is increasingly common nowadays. All these database-related matters are under the umbrella of data modeling, a field geared towards organizing data flows, optimizing the ways this data is stored, and ensuring that all the people involved (the consumers of this data) are on the same page. Although this is not strictly related to the data science field, it's imperative, plus knowing about it, enables better communication between the data architects (aka data modelers) and us.
You can learn more about SQL and other data modeling topics through the Technics Publications website. There, you can use the coupon code DSML for a 20% discount on all the titles you purchase. Note that this code may not apply to all the video material you'll find there, such as the courses offered. However, you can use it to get a discounted price for all the books. I hope you find this useful. Cheers!
(the lady in the picture is a metaphor for the "feature" or "set of features" in the dataset at hand)
If data science, a feature is a variable that has been cleaned and processed so that it's ready to use in a model. Most data models are sensitive to their inputs' scale, so most features are normalized before they are useful in these models. Naturally, not all features add value to a model or a data science project in general. That's why we often need to evaluate them, in various ways, before we can proceed with them in our project.
Evaluating features is an often essential part of the data engineering phase in the data science pipeline. It involves comparing them with the target variable in a meaningful way to assess how well they can predict it. This assessment can be done by either evaluating the features individually or in combination. Of these approaches, the first one is more scalable and more manageable to perform. Since there are inevitable correlations among the features, the individual approach may not paint the right picture, since two "good" features may not work well together since they depict the same information. That's why evaluating a group of features is often better, even if it's not always practical.
Note that the usefulness of a feature usually depends on the problem at hand. So, we need to be clear as to what we are trying to predict. Also, even though a good feature is bound to be useful in all the data models involved, it's not utilized the same way. So, having some intimate understanding of the models can be immensely useful for figuring out what features to use. What's more, the form of the feature is also essential in its value. If a continuous feature is used as is, its information is utilized differently than if it is binarized, for example. Sometimes, the latter is a good idea as we don't always care for all the information the feature may contain. However, it's best not to binarize continuous features haphazardly since that may limit the data models' performance.
The methodology involved also plays a vital role in the feature evaluation. For example, if you perform classification, you need to assess your features differently than if you are performing regression. Also, note that features play a different role altogether when performing clustering as the target variable doesn't participate (or it's missing altogether). As a result of all this, evaluating features is crucial for dimensionality reduction, a methodology closely linked to it and usually follows it.
You can learn more about features and their use in predictive models in my latest book, Julia for Machine Learning. This book explores the value of data from a machine learning perspective, with hands-on application of this know-how on various data science projects. Feature evaluation is one aspect of all this, which I describe through the use of specialized heuristics. Check out this book when you have a chance and learn more about this essential subject!
Dimensionality reduction is the methodology involving the coding of a dataset's information into a new one consisting of a smaller number of features. This approach tries to address the curse of dimensionality, which involves the difficulty of handling data that comprises too many variables. This article will explore a primary taxonomy of dimensionality reduction and how this methodology ties into other aspects of data science work, particularly machine learning.
There are several types of dimensionality reduction out there. You can split them into two general categories: methods involving the feature data in combination with the target variable, and methods involving the feature data only. Additionally, the second category methods can be split into those involving projection techniques (e.g., PCA, ICA, LDA, Factor Analysis, etc.), and those based on machine learning algorithms (e.g., Isomap, Self-organizing maps, Autoencoders, etc.). You can see a diagram of this classification below.
The most noteworthy dimensionality reduction methods used today are Principle Components Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and Autoencoders. However, in cases where the target variable is used, feature selection is a great alternative. Note that you can always combine different dimensionality reduction methods for even better results. This strategy works particularly well when the methods come from different families.
It's important to remember that dimensionality reduction is not always required, no matter how powerful it is as a methodology. Sometimes the original features are good enough, while the project requires a transparent model, something not always feasible when dimensionality reduction is involved. What's more, dimensionality reduction always involves some loss of information, so sometimes it's not a good idea. It's crucial to gauge all the pros and cons of applying such an approach before doing so, since it may sometimes not be worth it because of the compromises you have to make.
Many of the datasets found in data science projects today involve variables that are somehow related to each other, though this correlation is a non-linear one. That's why many traditional dimensionality reduction methods may not be as good, especially if the datasets are complex. That's why machine learning methods are more prevalent in cases like this and why there is a lot of research in this area. What's more, these dimensionality reduction methods integrate well with other machine learning techniques (e.g., in the case of autoencoders). This fact makes them a useful addition to a data science pipeline.
In my book Julia for Machine Learning, I dedicate a whole chapter in dimensionality reduction, focusing on these relatively advanced methods. Additionally, I cover other machine learning techniques, including several predictive models, heuristics, etc. So, if you want to learn more about this subject, check it out when you have the chance. Cheers!
The world of data professionals is sophisticated and diverse, especially nowadays. In involves professionals whose expertise ranges from the design of data flows to databases, data analytics models, machine learning systems, and APIs that connect the users to a cloud-based solution. It's not a simple matter, while the variety and depth of all these roles leave people bewildered and uncertain about what this ecosystem is and what it can do for an organization.
We can attempt to gain an understanding of this world by reviewing the various professionals found in it. First of all, we have the data architects (aka data modelers) responsible for designing data/information flows, facilitating communication among the people in an organization, and developing the infrastructures for all movement and storage of the organization's data. They are often involved in database solutions as well as ETL processes and the creation of glossaries. Data architects are essential in an organization, mainly when there is plenty of data involved, or the data plays a vital role in the organization's workflow. Most modern organizations are like that, while the abundance of data makes these professionals necessary.
Beyond this role, there are also data analytics professionals, particularly data scientists. This sort of professionals is involved in deriving values from the available data, usually through discovering insights. Data scientists are more geared towards messy (e.g., unstructured or highly noisy) data and more advanced models. All data analytics professionals work with databases through focused querying of them, while the creation of visuals based on the data is an essential part of their pipeline. Naturally, this role involves some programming (more so in the case of data scientists) and communication with each project's stakeholders. The creation of dashboards is a typical deliverable in this role, though other kinds of data products are sometimes developed instead.
Data engineers are also an essential kind of professionals in this ecosystem. This role entails data governance, particularly when big data is involved as well as various ETL processes that facilitate data analytics work. Managing containers in the cloud and specialized software like Spark is part of these professionals' job descriptions. Data engineers are heavy on programming and often deal with computer clusters, be it physical or virtual. Their communication with the project stakeholders is relatively limited, although they liaise with data scientists quite a bit. Some data engineers are well-versed in data science methods, particularly the development and deployment of predictive models.
Finally, business intelligence (BI) folks also have a role to play in the data world. This kind of professionals involves liaising with the managers and other project stakeholders. BI professionals tend to be more knowledgeable regarding the inner workings of an organization. Simultaneously, their use of data is limited to basic models, useful graphics, and descriptions of the problem at hand. BI professionals are more related to data analysts, though they tend to be more involved in high-level tasks. Also, their use of programming is minimal.
If you want to learn more about the data professionals' world, I invite you to check out some great books, like those available at the Technics Publications' site. Although geared more towards data modeling, this publisher covers the subject quite well, providing practical knowledge from various professionals in the fields as mentioned earlier. If you use the coupon code DSML, you can get a 20% discount on any books purchased. Check it out when you have the chance. Cheers!
As privacy matters gain more value these days, transparency also gains a lot of value in data science work. This is to be expected for another reason, which I hope it has become more obvious if you have been following this blog: transparent models are easier to explain to others. Beyond these advantages, there are other ones too (such as transparent models are easier to tweak and optimize), which I'm not going to elaborate on right now. Instead, I'm going to look at the various data models used in data science and where they fall on the transparency spectrum.
On the one extreme of this spectrum lie the most transparent data models. These are usually Stats-based since they can provide exact proportions of each feature's contribution. Also, you know exactly what's going on with the decisions involved in the predictions they yield. Even if you know nothing about data science you can still make sense of these models and understand the predictions they yield. The main disadvantage of these models is that they are not as accurate, partly because of the overly simple processes they use.
On the other extreme of the spectrum, you can find the most opaque data models. These are usually AI-based and are referred to as black boxes. Not only do they not tell us anything about feature importance, but trying to explain their inner workings is a futile task. However, they tend to have an edge in performance when it comes to accuracy, plus they require very little prep work for the data they use (data engineering).
Somewhere in the middle of the spectrum lie all the other models, mostly under the machine learning category. These include random forests and boosted trees (some transparency), k nearest neighbor (very little transparency), support vector machines (no transparency), and fuzzy logic systems (pretty decent transparency). That’s a category of models most people forget since they tend to think of transparency as a binary attribute.
Finally, it’s good to remember that transparency is usually linked to a business requirement. Also, sometimes the performance you obtain from the black box models is a good trade-off since some projects require high accuracy in the predictions involved. So, transparency is not always a necessity even if it can facilitate the communication of these models to the project stakeholders. As a result, it's always good to think about whether you need this extra transparency that a statistical model may offer you if you can achieve better performance with a less transparent model.
For more information about transparency and other aspects of data science models (particularly machine learning related), you can check out my latest book, Julia for Machine Learning. It is a very hands-on kind of book, which doesn't neglect to provide a lot of information needed to build the right mindset when it comes to data science work. Also, it includes lots of examples and links to useful resources that can help you understand all the concepts involved.
Although I covered this topic briefly about a year and a half ago, it seems that it's due for an update. After all, many people still are unaware of this terrific tool, while I always get positive feedback when I introduce it to mentees of mine. In a nutshell, Wakelet is a simple collection tool for organizing and sharing content over the internet. The collections (aka wakelets) can be private, public, or shareable with specific individuals via a link.
The Wakelet website does a great job of informing people about the merits of this tool, which is quite popular among educators. What it doesn't tell you is that it's great for data science practitioners too. Namely, a wakelet can be a great place to exhibit your portfolio of projects, as well as any other material that you’ve created that’s relevant to a data science career. You can also include any publications you may have, any videos you’ve created, and any programs you’d like to share with the data science world. The big advantage of wakelets is that you can add supplementary text to accompany your material, so the whole thing is more meaningful and accessible to your audience. The free graphics the program offers are also useful for making the collection more appealing to newcomers.
So far I’ve developed a few wakelets, mostly around the AI-related articles I’ve written and the books I’ve authored. Also, there are a few wakelets that I keep private as well as another one I’ve shared with an associate of mine. What’s more, I plan to continue creating wakelets as I have more material to share (e.g. webinars, videos, etc.) The community aspect of Wakelet is something I’ve recently discovered and I’m in the process of exploring. In any case, it’s always interesting to view other people’s wakelets and get ideas about how to organize shareable content elegantly.
The collaboration aspect of Wakelet is something worth exploring too. It involves two or more people working on the same wakelet, either contributing or editing content. This can be done in the traditional way whereby the contributors access a wakelet independently, or they can collaborate through MS Teams and share content from there (e.g. conversations) through their wakelet. Wakelet collaboration is still fairly new as a feature but it's getting quite popular and it's something worth looking into, for sure.
Wakelet is quite popular among content creators but it seems that its target audience is growing as it develops new features and a larger community of users. As a result, it may become the go-to option for sharing any content that's large enough to not fit in a single document. Also, as wakelets can be organized efficiently and elegantly in the wakelet page, it makes sense to create several of these collections and perhaps even link them together, when this makes sense. In any case, the fact that all these collections are also accessible through the corresponding app makes it a versatile and practical tool. So, I invite you to check it out and let me know what you think about it. Cheers!
With all this talk these days about Statistics and other frameworks and their immense value in data science, it’s good to be more pragmatic about this matter. After all, it’s not a coincidence that Machine Learning maintains the top position both as a framework and as a specialization when it comes to data science work. In this article, we'll explore why this is.
First of all, machine learning is a more scientific paradigm for data science. It doesn't make any assumptions as it relies on the data at hand and nothing else. Well, there are also the ML models that it makes use of, but it doesn't try to model everything as this or the other distribution and rely on metrics based on these distributions. The scientific approach has proven itself to be very useful in understanding the world, so it only makes sense that it is used (in the form of machine learning methods) in data science too.
What’s more, machine learning makes use of more advanced methods than other frameworks. After all, it makes sense that if a framework works well, as in the case of machine learning, more methods are researched and refined. As a result, the models that machine learning brings to the table are more state-of-the-art and efficient. This makes using the machine learning framework a no-brainer, particularly when it comes to critical processes where accuracy and efficiency are key requirements.
Also, machine learning nowadays is powered to a great extent by AI, creating powerful models that outperform anything else available to a data scientist. This may be a trend that's here to stay since many AI-based model have proven to be exceptionally good and versatile. Although these models have special requirements that may not be met in every data science option, it's good that there is this option available for data science work.
Moreover, machine learning is easier to learn and use since it doesn't have a lot of theory behind it. As a result, you don't need to spend a lot of time learning it or having to worry about the requirements of each model, like in Statistics. Of course, there is some theory in this framework too, but it's fairly straight-forward and doesn't require too specialized math to learn it to an adequate degree.
Finally, there are lots of libraries nowadays for every machine learning model or process, making it easy to implement. In other words, you don't have to do a lot of coding to get your machine learning method up and running. Also, the fact that there is usually adequate documentation in these libraries makes it easier to understand the corresponding programs and the techniques too, supplementing your learning.
Speaking of learning, if you wish to learn more about machine learning through a hands-on approach to the subject, feel free to check out my latest book, Julia for Machine Learning (Technics Publications). There I talk about the subject in some depth, while I explain how you can use Julia to deploy different kinds of machine learning models and heuristics. Cheers!
In a previous article we talked about the value of data modeling and how it is related to data science as a field. Now let’s look at some great ways to learn more about this field.
Specifically, Technics Publications offers a few classes/workshops on data modeling this Autumn:
What’s more, you can get a 20% discount on them, if you use the coupon code DSML. You can use the same code for most of the books available on that site. Check it out!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.