Delimited File Formats and Their Usefulness in Data Science

11/5/2020

Delimited files are specialized files for storing structured (tabular) data. They are uncompressed and easy to parse through a variety of programs, particularly programming languages. Delimited files are widely used in data analytics projects, such as those related to data science. But what are they about, and why are they so popular?

There are various types of delimited files, all of which have their use cases. The most common ones are commas separated variables (CSV) files and tab-separated variables (TSV). However, any character can be used to separate the various values of the variables, such as the pipe (|) or the semicolon (;). All of the delimited files are similar, though, in the sense that they contain raw data organized through the use of a delimiter (the character as mentioned earlier). Note that the variable names are included in the delimited file in many cases, usually in the top row. Like the rest of the file, that row also has its values (in this case, the variables) separated by the delimiter.

Delimited files are super useful in data science work and data analytics work in general. They are straightforward to produce (e.g., most software has the "export as CSV" option available) and easy to access since they are, in essence, just text files. Also, every data science programming language has a library for loading and saving data in this file format. The fact that many datasets are available in this format is a consequence of that. Additionally, if a delimited file is corrupt, it's relatively manageable to pinpoint the problem and correct it. Yet, even if the problem is unfeasible to remedy, you can still access the healthy part of the file and retrieve the data there. In the case of specialized data files involving compression, this isn't possible most of the time.

In Julia, there is a CSV library called CSV.jl (very imaginative name, I know!). Although its functionality is relatively basic, it is super fast (much faster than the Python equivalent), while it has excellent documentation. Despite what the name suggests, this library can be used for delimited files, not just CSVs. The parameter "delim" is responsible for this, though you don't need to set it always, since the corresponding function can figure out the delimiter character on its own, most of the time.

Delimited files are instrumental in data science work, but they aren't always the best option. For example, in cases when you are dealing with semi-structured or unstructured data, it's best to use a different format for your data files, such as JSON. Also, suppose you are working with NoSQL databases. In that case, delimited files may not be useful at all, since the data in those databases are best suited for a dictionary data structure, such as that provided by JSON and XML files.

If you want to learn more about this topic and other topics related to data science work, feel free to check out my book Data Science Mindset, Methodologies, and Misconceptions. In this book, I talk about all data science-related matters, including data structures and such, providing a good overview of all the material related to this fascinating field. So, check it out when you have the chance. Cheers!

0 Comments

FOXY DATA SCIENCE
unconventional insights about data science, A.I., cybersecurity, data analytics, and more