Data analytics and data science are both about finding useful (preferably actionable) insights in the data and helping with the decisions involved. In data science, this usually involves predictive models, usually in the form of data products, while the analysis involved is more in-depth and entails more sophisticated methods. But what tools do data analysts and data scientists use?
First of all, we need to examine the essential tasks that these data professionals undertake. For starters, they usually gather data from various sources and organize it into a dataset. This task is usually followed by the task of cleaning it up to some extent. Data cleaning often involves handling missing values and ensuring that the data has some structure to it afterward. Another task involves exploring the dataset and figuring out which variables are the most useful. Of course, this depends on the objective, which is why data exploration often involves some creativity. The most important task, however, is building a model and doing something useful with the data. Finally, creating interesting plots is another task that is useful throughout the data analyst/scientist's workflow.
You can do all the tasks mentioned earlier using a variety of tools. More specifically, for all data acquisition tasks, a SQL piece of software is used (e.g., PostgreSQL). This sort of software makes use of the SQL language for accessing and querying structured databases so that the most relevant data is gathered. For data science work, NoSQL databases are also often used, along with their corresponding software. As for all the data analytics tasks, a program like MS Excel is used by all data analysts, while a programming language like Python or Julia by all the data scientists. Data analysts use Tableau or some similar application for data visualization tasks, while data scientists employ a graphics library or two in the programming language they use. For all other tasks (e.g., putting things together), both kinds of data professionals use a programming language like Python/Julia.
Naturally, all of the tools mentioned previously have evolved and continue to evolve. The fundamental functionality may remain about the same, but new features and changes to the existing features are commonplace. For example, new libraries are coming about in programming languages, expanding the usefulness of the languages while creating more auxiliary tools for various data-related tasks. No matter how these tools evolve, the one thing that doesn't change is the mindset behind all the data analytics/science work. This mindset is the driving force of all such work and the ability to use whatever tools you have at your disposal to make something useful for the data at hand. The mindset is closely related to a solid understanding of the methodologies involved in data analytics/science.
For more information on this subject, particularly the mindset part, check out my book “data science mindset, methodologies, and misconceptions.” In it, I cover this topic in sufficient depth, and even if it is geared towards data science professionals, it can be useful to data analysts and other data professionals also. Cheers!
Ensembles are sophisticated machine learning models that comprise of other simpler models. They are quite useful when accuracy is the fundamental requirement, and computational resources are not a severe limitation. Ensembles are trendy in all sorts of current projects as they have significant advantages over conventional models. Let's take a closer look at this through this not-too-technical article.
Ensembles are an essential part of modern data science as they are more robust and powerful as models. Additionally, ensembles are ideal in cases of complex problems (something increasingly common in data science), as they can provide better generalization and more stability. In cases where conventional models fail to provide decent results, ensembles tend to work well enough to justify using data science in the problem at hand. That's why data scientists usually use them after all attempts to solve the problem with conventional models have failed.
The most common kind of ensembles is those based on the decision tree model. The main reason for this is because this kind of model is relatively fast to build and train, while it yields reasonably good performance. What's more, it's easy to interpret, something that bleeds to the ensemble itself, making it a relatively transparent model. Other ensembles are based on the combination of different models, belonging to different families. This heterogeneous architecture in the ensemble enables it to be more diverse in processing the data and to yield better performance. For all the ensembles out there, there needs to be a heuristic in place to figure out how the different outputs of the models that comprise the ensemble are fused. The most straightforward such heuristic is the majority voting, which works quite well in cases when all the ensemble models are similar.
The main drawback of ensembles is a particular comprise you have to make when using them. Namely, the transparency greatly diminishes, while in some cases, it disappears altogether. This phenomenon can be an issue when you need the model you built to be somewhat interpretable. Additionally, ensembles can overfit if the dataset isn't large enough or if they comprise of a large number of models. That's why they require special care and are better off being handled by experienced data scientists. Finally, ensembles require more computational resources than most data models. As a result, you need to have a good reason for using them in a large-scale scenario, since they can be quite costly.
You can learn more about ensembles and other machine learning topics in one of my recent books. Namely, in the Julia for Machine Learning book, published last year, I extensively cover ensembles and several other relative topics. Using Julia as the language through which the various machine learning models and methods are implemented, I examine how all this can be useful in handling a data science project robustly. Check it out when you have a moment. Cheers!
Questions are an integral part of scientific work, especially when research is concerned. Without questions, we can't have experiments and the capability of testing our observations with our current understanding of the world. Naturally, this applies to all aspects of science, including one of its most modern expressions: data science. This article will explore the value of questions in science, how they relate to hypotheses, and how data science comes into the picture.
Questions are what we ask to figure out how things fit in the puzzle of scientific work. They can be general or specific (the latter being more common), and they are always linked to hypotheses in one way or another. The latter are the formal expressions of questions and usually take a true or false label, based on the tests involved. Tests are related to experimentation, whereby we see how the evidence (data) accumulated fits the assumptions we make. Without questions and hypotheses, we cannot have scientific theories and anything else useful in this paradigm of thought. Note that all this is possible because we are open to being wrong and are genuinely curious to find out what's going on in the area we are investigating.
Questions and hypotheses are also vital because they are a crucial part of the scientific method, the cornerstone of science. Since its inception a few centuries ago, the scientific method has laid the framework of scientific work, particularly in forming new theories and developing a solid understanding of the world. It's closely tied to experimentation since, at its core, science is empirical and relies on observable data rather than predefined ideas. Although it's not as prominent as it used to be, the latter is in the realm of philosophy, although it has a role in science. The scientific method is precise and relies on a disciplined methodology in scientific work, but it's also open to creativity. After all, not all questions are deterministic and predictable; some of them may be led by a deeper understanding of the world, led by intuition.
But how does all this relate to data science? First of all, data science is the application of scientific methodologies in problems beyond scientific research. This way, it is a broader application of scientific principles and the scientific method, involving data from all domains. Because of this, it's crucial to think about questions and try to answer them as systematically as we can, using the various data analysis methodologies at our disposal. Not all of these methodologies are as clear-cut and easy as Statistics. Still, all of them involve some models that try to describe or predict what would happen when new data comes into play, something rudimentary in scientific work.
If you wish to learn more about this topic and data science's application of the scientific method, check out my latest book Doing Science Using Data Science. In this book, my co-author and I explore various science-related topics, emphasizing the practical application of data science methodologies, describing how these ideas are implemented in practice. The book is accompanied by a series of Jupyter notebooks, using Julia and Python. Check it out when you have a moment. Cheers!
A data product is the main deliverable of data science and some data analytics projects. It involves developing a stand-alone piece of software, often with a data model under the hood. Other times, it takes the form of a set of visualizations that depict particular variables of interest or other useful insights. In any case, data products are vital as they constitute an essential part of a data science project and a useful deliverable in a data analytics project (even if it's not always a requirement).
Dashboards are a kind of data product, featuring graphics and an intuitive (albeit minimalist) interface. They sometimes involve some control element that enables the user to change some settings and adjust the related graphics to different operating conditions. This element provides a more dynamic aspect to the dashboard, which augments the innate dynamism they have. The latter stems from the fact that they are usually linked to a dataset that changes over time, as new data becomes available.
The popularity of dashboards illustrates data visualization's value, be it in data science or data analytics. It's hard to imagine a project like this without some visuals, pinpointing important insights and other findings. Additionally, whenever predictive models are involved, specialized visuals for showcasing the models' performance are a must. That's why data visualization as a sub-field of data science and data analytics has grown, especially in the past few years. The development of professional software undertaking such tasks and specialized libraries in various programming languages have contributed to this growth.
Beyond data visualization, however, other subtle aspects of the data science and data analytics fields are essential but less pronounced in the various educational material out there. For example, the communication of insights and using the visuals mentioned earlier in presentations is something every data professional ought to know. This point is particularly important when you need to liaise with non-technical people, whether colleagues or clients. Also, managing a data analytics project can be challenging, especially in the modern Agile-driven workplace. After all, most data analytics projects today are all about teamwork and tight deadlines, and changing requirements. What's more, although a dashboard is a powerful asset in an organization, it needs to be maintained periodically and fed good-quality data. The latter requires additional work and proper data governance, which not everyone involved in this field is usually aware of, unfortunately.
My Data Scientist Bedside Manner book, which I co-authored last year, is an excellent resource for this kind of topic. Although written for data science professionals mainly, it can be useful to all sorts of data analysts and people involved in data-driven projects (e.g., managers). The idea is to bridge the gap between technical and non-technical professionals in an organization and leverage data analytics work effectively. This is an excellent reference book that every data professional can benefit from in the years to come. Cheers!
Programming, particularly in languages like Julia, Python, and Scala, is fundamental in data science. It enables all kinds of processes, such as data engineering (particularly ETL tasks), data modeling, and even data visualization, to name a few. If you know what you are doing, you can also solve practical problems through programming (e.g., optimization tasks) by modeling them appropriately. It's a versatile tool with a lot of potential, especially once you get used to it and see it as an extension of your mind. However, it's not as simple as putting bits and pieces of programming code together. This article will examine the various strategies for coding concerning the various objectives we may have.
Let's start with the most intuitive kind of objective, namely getting things done as quickly as possible. This strategy may be suitable for solving problems that need a solution once, so the code doesn't need to be revised or reused again. This sort of strategy involves writing code that works to prove a particular concept before solving the task more seriously. Efficiency isn't pursued, nor is readability and the use of lots of comments, to explain what's happening. This strategy is typical for solving a drill or a relatively simple problem that you don't need to present to the project stakeholders. As a result, using this strategy for other scenarios is a terrible idea.
Another objective we may have when writing a script is efficiency. When we process lots of data, we don't want to have lazy code that takes a while to finish the task at hand. So, optimizing the code for efficiency through smart memory allocations, static typing, using the appropriate variable types, etc. can help with that. This programming strategy is quite a common one that can save us a lot of time. However, it's also useful when deploying this code at scale, since it means that we'll be using fewer computational resources (CPU/GPU power and memory), lowering the cost of the project at hand.
Interpretability and maintainability are a different objective altogether, tied to the final programming strategy. So, if you want your program to be easy to read and understand, making it easy to update when necessary, you opt for this strategy. It involves organizing your code to break the problem down into simple tasks handled in different classes and functions, including lots of comments explaining your reasoning and what different functions do. Naming the variables in an intuitive way is also a big plus, even if that makes the code longer at times. In any case, such code is built to last since it's easy to maintain and helps newcomers that view it adopt good practices when writing their own code.
Naturally, you can use a combination of the above strategies for your project. Not all of them play ball together, of course, but you can still make a script that's efficient and easy to understand/maintain. So, unless pressed with time, it's good to have such an approach to your programming, adapting it to each project's requirements.
If you wish to learn more about programming and how it applies to data science, you can check out one of my latest books, Julia for Machine Learning. This book explores how the Julia programming language can be used to tackle various data science problems, using machine learning models and heuristics. Accompanied by a series of examples in Jupyter notebooks and script files, it illustrates in quite comprehensible code how you can implement this framework for your data science work. So, check it out when you have a moment. Cheers!
Machine Learning is the field involved in using various algorithms that enable a machine (typically a computer) to learn from the data available, without making any assumptions about it. It includes multiple models, some simpler, others more advanced, that go beyond the statistical analysis of the data. Most of these models are black-boxes, though a few exhibit some interpretability. Yet, despite how well-defined this field is, several misconceptions about it conceal it in a veil of mystique.
First of all, machine learning is not the same as artificial intelligence (A.I.). There is an overlap, no doubt, but they are distinct fields. You can spend your whole life working in machine learning without ever using A.I. and vice versa. The overlap between the two takes the form of deep learning, the use of sophisticated artificial neural networks that are leveraged for machine learning tasks. Computer Vision is an area of application related to the overlap between machine learning and A.I.
What’s more, machine learning is not an extension of Statistics. Contrary to what many Stats fans say, machine learning is an entirely different field distinct from Statistics. There are similarities, of course, but they have fundamental differences. One of the key ones is that machine learning is data-driven, i.e., it doesn't use any mathematical model to describe the data at hand, while Statistics does just that. It's hard to imagine Statistical models without a data distribution or some function describing the mapping, while machine learning models can be heuristics-based instead.
Nevertheless, machine learning is not purely heuristics-based and, therefore, void of theoretical foundations. Even if it doesn't have the 200-year-old amalgamation of the Statistics theory, machine learning has some theoretical standing based on the few decades of research on its back. Many of its methods rely on heuristics that "just work," but it's not what people consider alchemy. Machine learning is a respectable scientific field with lots to offer both to the practitioner and the researcher.
Beyond the misconceptions mentioned earlier, there are additional ones that are worth considering. For example, machine learning is not plug-and-play, as some people think, no matter how intuitive the corresponding libraries are. What's more, machine learning is not always the best option for the problem at hand, since some projects are okay with something simple that's easy to understand and interpret. In cases like that, a statistical model would do just fine.
It's hard to do this topic justice in a single blog post, but hopefully, this has given you an idea of what machine learning is and what it isn't. I talk more about this subject in one of my most recent books, Julia for Machine Learning. Additionally, I plan to cover this topic in some depth in a 90-minute talk at the next Data Modeling Zone conference in Belgium this April. I hope to see you there! Cheers.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.