Lately (and I use this term loosely), there's been a lot of talk about deep learning. It's hard to find an article about data science that doesn't mention Deep Learning in one way or another. Yet, despite all its publicity, Deep Learning is still conflated with machine learning by most of the people consuming this sort of article. This misrepresentation can lead to misunderstandings that can be costly in a business setting, as there can be a disconnect between the data science team and the project stakeholders. Let's look into this topic more closely and clarify it a bit.
Machine Learning is a relatively broad field that has become an instrumental part of data science. Complementary to Statistics, Machine Learning incorporates a data-driven approach to analyzing data. This approach involves the use of heuristics and predictive models. Most models used by data scientists today tend to fall into this category. Things like Random Forests and Boosted Trees are commonplace and powerful, while they are classic examples of machine learning. But these aren't the only ones, and lately, they have started to give way to other, more powerful models. The latter is in deep learning territory.
Deep Learning is part of AI and deals with machine learning problems. It's still an innate part of the AI field, but because of its applicability in Machine Learning, it is often considered to be part of the latter too. After all, AI has spread in various domains these days, and as predictive analytics is one domain where it can add lots of value, its presence there is considerable. In a nutshell, Deep Learning involves large artificial neural networks (ANNs) that are trained and deployed for tackling data science-related problems. There are several such networks, but they all share one key characteristic: they go deep into the data, through the development of thousands of features, in an automated manner, for understanding the intricacies of the data. This sophistication enables them to yield higher accuracy and harness even the weakest signals in the data they are given.
Deep Learning has been quite popular lately, not just because of its innovative approach to analytics but primarily because of the value it adds to data science projects. In particular, deep learning systems are versatile and can be used across different domains, given sufficient data and enough diversity in that data. They aren't handy just for images, while newer areas of application are being discovered constantly. Additionally, deep learning systems can do without a lot of data engineering (e.g., feature engineering) since this is something they undertake themselves. In other words, they offer a shortcut of sorts for the data scientists who use them, making their projects more efficient. Finally, deep learning systems can be customized considerably, making them specialized for different domains. That's particularly useful for developing better models geared towards the specific data available to you.
Of course, the whole topic of deep learning is much deeper than all this. What's more, despite its usefulness, it's not always appropriate since conventional machine learning is also quite relevant in data science today. Moreover, there are other AI-based systems usable in data science, such as those based on Fuzzy Logic. In any case, there is no one-size-fits-all solution, which is why it's better to be well-versed on the various options out there. A great place to start learning about these options in a hands-on way is my latest book, Julia for Machine Learning, where we tackle various data science problems using various machine learning methods. Check it out when you have a moment!
More and more datasets these days contain sensitive data capable of identifying the people behind those ones and zeros. We usually refer to this kind of data as personally identifiable information or PII for short. PII is a privacy concern for every data scientist or analyst working with such a dataset since if it leaks, we're all in trouble! Not just the data scientist, but also the whole organization, especially if it's complying with privacy regulations like GDPR. Let's look into this matter in more detail.
First of all, PII-related privacy is inevitable in most data science projects today in the real world. Chances are that at least some of the variables you deal with contain some type of sensitive data. These can be things like names, contact details, credit card numbers, and even health-related data (this latter kind of PII is particularly important since most of it cannot be changed, in contrast to a credit card). Even geo-location data is often under the PII umbrella though on its own it's not so sensitive because it's hard to match it to a particular individual without using some other variable too.
This matching of particular variables to specific individuals is the source of all privacy-related problems. It's not so much the fact that some people's identities are compromised that's the issue (who cares if it becomes public that I enjoy a cup of coffee at the local coffee shop every morning?) but the fact that this data is supposedly protected. When it's out in the open, it's a breach of some privacy legislation, while the organization that handles this data is liable for a lawsuit. To make matters worse, if word gets out that a particular company doesn't protect its clients' sensitive data adequately, its reputation is bound to suffer, and its brand can be damaged. Not to mention that some of this PII can be traded in the black market, so if a malicious hacker gets hold of it, it can make things even more challenging to manage.
To avoid these problems, we need to handle PII properly. You can do this in various ways, some of which we're going to explore in future articles. As I've lately delved more into Cybersecurity and Privacy, I can provide a better perspective on this subject, which can tie into data science work more practically. However, should you wish to delve into this topic a bit now, you can check out my latest video course on WintellectNow, titled Privacy Fundamentals. There I cover various practical ways about securing privacy in your personal and professional life. It's not data science-focused, but it can help you cultivate the right mindset that will enable you to handle PII more responsibly. Stay tuned for more material in the coming months. Cheers!
A-B testing plays a crucial role in traditional science as well as data science. It isn't easy to imagine a scientific experiment worth its time without A-B testing. It's such a useful technique that it features heavily in data analytics too. In this article, we'll explore this essential method of data analysis, focusing on its role in scientific work and data science.
In a nutshell, A-B testing uses data analysis to determine if two different samples are significantly different from each other, concerning a given variable. The latter is usually a continuous variable, used to examine how different the two samples are (it can be nominal too, however). The two samples often derive from a partitioning of a dataset based on another variable, which is binary. A-B testing is closely linked to Statistics, although any heuristic could be used to evaluate the difference between the two samples. Still, since Statistics yields a measurable and easy-to-interpret result in the form of a probability (p-value), it's often the case that particular statistical tests are used for A-B testing.
A-B testing is used heavily in scientific work. The reason is simple: since there are several hypotheses the analyst considers, it's often the case that the best way to test many of these hypotheses is through A-B testing. After all, this methodology is closely linked to the formation of a hypothesis and its testing, based on the data at hand. Naturally, the usefulness of A-B testing is also apparent in data science and data analytics during the data exploration stage.
The statistical tests used for A-B testing are t-tests, chi-square tests, and to a lower extent, z-tests. The t-test handles cases where a continuous variable is involved (e.g., Sales), while the chi-square one is geared towards nominal variables. Z-tests are very much like t-tests, but they are less powerful and make stronger assumptions about the data. All statistical tests yield a p-value as a result, which is compared to a predefined threshold (alpha), taking values like 0.05, 0.01, or 0.001. The lower the p-value, the more significant the result. Having a p-value lower than the alpha value means that you can safely disprove the Null Hypothesis (which states that any differences between the two samples are due to chance).
Note that A-B testing is a deep topic, and it's hard to do it justice in a blog article. Also, it requires a lot of practice to understand it thoroughly. So, if it sounds a bit abstract, that's normal, especially if you are new to Statistics. Cheers!
The data scientist and the data analyst both deal with data analysis as their primary task, yet those two roles differ enough to warrant an entirely different set of expectations for each. Both share common attributes and skills, however, making them more similar than people think. This similarity allows a relatively more straightforward transition from one role to another, if needed, something not everyone realizes. This article explores this situation's details and makes some suggestions as to how each role can benefit the other.
The two roles are surprisingly similar, in ways going beyond the surface kinship (i.e., data analysis). Data scientists and data analysts deal with all kinds of data (even though text data is not standard among data analysts), often directly from databases. So, they both deal with SQL (or some SQL-like language) to access a database and obtain the data needed for the project at hand. Both kinds of professionals deal with cleaning and formatting the data to some extent, be it in a programming language (e.g., Python or Julia), or some specialized software (e.g., a Spreadsheet program, in the case of data analysts). Also, both data scientists and data analysts deal with visuals and presentations containing these graphics. Finally, both kinds of professionals write reports or some form of documentation for their work and share it with the project's appropriate stakeholders.
Despite the sophistication of our field, we can learn some things from data analysts as data scientists. Particularly the new generation of data scientists, coming out of bootcamps or from a programming background, have a lot to benefit from these professionals. Namely, the data analysts are closer to the business side of things and often have domain knowledge that data scientists don't. After all, data analysts are more versatile as professionals in employability, making them more prone to gathering experience in different domains. Also, data analysts tend to have more developed soft skills, particularly communication, as they have more opportunities to hone them. Learning all that can benefit any data scientist, especially those who are new to the field.
Additionally, data analysts can learn from data science professionals too. Specifically, the value of an in-depth analysis that we do as data scientists are something every analyst can benefit from undoubtedly. In particular, data engineering is the kind of work that adds a lot of value in data science projects (when it's done right) and something we don't see that much in data analytics ones. What's more, predictive modeling (e.g., using modern frameworks, such as machine learning) is found only in data science, yet something a data analyst can apply. Once someone has the right mindset (aka the data science mindset), it's not too difficult to pick up those skills, particularly if they are already versed in data analytics.
If you wish to learn more about the soft skills and business-related aspects of data science, you can check out one of my relatively recent books, Data Scientist Bedside Manner. In this book, my co-author and I look into the organization hiring data scientist, the relevant expectations, and how such a professional can work effectively and efficiently within an organization. So, check it out if you haven't already. Cheers!
Robotic Process Automation, or RPA for short, is a methodology involving the automation of specific data-related processes through specialized scripts. RPA usually applies to low-level monotonous, and easy-to automate tasks, though lately, it has escalated into other tasks that are more high-level. RPA is quite popular for saving money, but it has its perils when it is in excess. Whatever the case, it dramatically impacts our field, so it's essential to know about it. This article is all about that.
Let's start by looking at the usefulness of RPA. RPA is useful in the extract, transform, and load (ETL) operations of all kinds. This kind of work involves moving the data from A to B, making changes to it, and making it available to other project stakeholders. Even in data science, there is a lot of ETL work involved, and often the data scientist is expected to undertake it. Naturally, RPA in ETL is very useful as it saves us time, time which we can spend on more challenging tasks, such as picking, training, and refining a model. Also, many data engineering tasks require a lot of time, so if they can be automated to some extent through RPA, it can make the whole project more efficient.
The danger of RPA when in excess becomes apparent when we try to automate the whole data science pipeline or even just individual parts of it. Take data engineering, for example. If we were to automate the whole thing, we'd be left with no say in what variables are worth looking into and what the data has to tell us. All initiatives from the data scientist would be gone, and the whole project would be mechanical and even meaningless. Some insights might still come about, but the project wouldn't be as powerful. The same goes for other parts of the data science workflow.
RPA is used in specialized frameworks for data modeling, the part of the pipeline that follows after data engineering. Systems like AutoML, for example, attempt to apply RPA in data modeling through various machine learning models. Although this may have some advantages over traditional approaches, in the long run, it may rob data scientists of their expertise and the personal touch of the models developed. After all, if everything becomes automated in data science, what's the point of having a human being at that role? Maybe certain occupations can be outsourced to machines, but it's not clear how outsourcing even the more high-level work to them will benefit the whole.
The good news is that even though RPA can undertake many aspects of data science work, it cannot replicate the data science mindset. This frame of mind is in charge of how we work the data and why. It's what makes our work worth paying for and what brings about real value from it. You can learn more about the data science mindset and other aspects of data science work from my book Data Science Mindset, Methodologies, and Misconceptions. Feel free to check it out when you have some time. Cheers!
Not to be confused with graphics (visuals), graphs are a data structure used in data science work. This data analysis approach is quite popular today, yet it's not covered adequately in the data science literature. The fact that it's a fairly advanced topic may have something to do with that. In any case, graphs are a powerful tool as they capture information without worrying about dimensionality, which, although manageable in most cases, can be challenging to overcome in more complex problems. This article will explore graphs, the theory behind them, their applications, and how they are stored.
Let's start by taking a look at graphs and the theory behind them. First of all, a graph is a representation of information in the form of a graphical structure consisting of nodes and arcs. Nodes represent entities of interest (e.g., people in a social media network), and arcs are the connections among these entities (e.g., an online friendship between them). The nodes and the arcs may have additional metadata attached to them, such as attributes (name of that person, date of birth, the duration of each friendship connection, etc.). The attribute that depicts the "strength" or "length" of an arc is referred to as its weight, and it's crucial for various graph algorithms. So, a graph can formally represent the relationships of a set of entities, along with any supplementary information involved. The mathematical framework of graphs and a series of useful theorems and heuristics are referred to as Graph Theory.
Graphs have lots of useful applications in data science work. When they are used for data analysis, they go by the term Graph Analytics. This kind of analytics is an integral part of data science, although it's not always essential for the average data science project. For example, if you care about analyzing a complicated situation, like the logistics of an organization, graphs may come in very handy. However, if you care about predicting the next quarter's sales, graphs may be overkill, plus the problem would be more easily solved through a time series regression model. Also, for problems involving lots of dimensions, graphs can be handy because they can model the data points' relationships through a similarity metric that they can use as the set of weights of the various arcs in a graph.
But where do we store these elaborate data structures? Fortunately, there are specialized databases for this task, aka graph databases. This kind of data storage and retrieval system allows for efficient encoding of the graphs and some useful operations with them. Neo4j is a popular such database, as well as a graph visualization tool. However, there are others more refined that are also able to manage enormous datasets. Most modern graph databases today are NoSQL databases that can handle other kinds of datasets, not just graph-based ones. A well-known such database is ArangoDB.
If you are interested in learning more about this and other useful data science methodologies, check out my book Data Science Mindset, Methodologies, and Misconceptions. In this manuscript, I delve into all kinds of methodologies related to data science work, including graphs and NoSQL databases. However, the focus is on the data science mindset, which is essential for any data scientist as well as remaining relevant in this field. Check it out when you have a moment. Cheers!
The data scientist role is an incredibly important one in the world today. Be it in for-profit organizations or non-profit ones, it has a lot of value to add and aid decision-making. However, it's still unclear what exactly it entails and how someone can become a data scientist starting from a data analytics background.
The data scientist is a tech professional who processes data, especially complex data, in large amounts (aka big data) to derive insights and build data products. This role involves gathering data, cleaning it up, combining it with other relevant data, evaluating the features involved, and building models based on them, usually to predict some variable of interest or solve some complex problem. It also involves creating insightful visuals and presenting your findings to the project stakeholders, with whom you often need to liaise throughout the data science projects. For all this work, you need to use a lot of programming and various data analysis methods, particularly machine learning.
To transition to the data scientist role from the data analyst one, you need to beef up your programming skills and work on your data analysis methodologies. Learning more techniques for pre-processing data (data engineering) is also essential. What's more, you need to familiarize yourself with various methods for depicting data, such as graphs, and how to process the data in this sort of encodings. Dimensionality reduction methods are also vital for assuming the data scientist role, just like various sampling techniques. Furthermore, handling data in different formats (e.g., JSON, XML, and text) is essential, particularly in projects that deal with semi-structured data. Naturally, having some familiarity with NoSQL databases is also very important, as it goes hand-in-hand with this sort of data.
Naturally, all this is the tip of the iceberg when it comes to transitioning into a data scientist from a data analyst. To make sure this transition is solid enough to build a career on top of it, you need to develop other skills and a good understanding of the complex data involved in data science projects. Being able to communicate with other data professionals well and understand them is also very important. Nowadays, you often have to work as part of a data science team, which involves a certain specialization level. So, having such expertise is significant, at least for certain data scientist positions.
You can learn more about this topic by reading my first book on data science, namely the Data Scientist: The Ultimate Guide to Becoming a Data Scientist one. This book covers various topics related to data scientist has a whole section dedicated to similar roles. It is also written in an easy-to-follow way, without too much technical jargon, while it also has a glossary at the end. Interviews with data scientists of various levels help clarify the role's details and how it is on a practical level. So, check it out when you have a moment. Cheers!
Data analytics and data science are both about finding useful (preferably actionable) insights in the data and helping with the decisions involved. In data science, this usually involves predictive models, usually in the form of data products, while the analysis involved is more in-depth and entails more sophisticated methods. But what tools do data analysts and data scientists use?
First of all, we need to examine the essential tasks that these data professionals undertake. For starters, they usually gather data from various sources and organize it into a dataset. This task is usually followed by the task of cleaning it up to some extent. Data cleaning often involves handling missing values and ensuring that the data has some structure to it afterward. Another task involves exploring the dataset and figuring out which variables are the most useful. Of course, this depends on the objective, which is why data exploration often involves some creativity. The most important task, however, is building a model and doing something useful with the data. Finally, creating interesting plots is another task that is useful throughout the data analyst/scientist's workflow.
You can do all the tasks mentioned earlier using a variety of tools. More specifically, for all data acquisition tasks, a SQL piece of software is used (e.g., PostgreSQL). This sort of software makes use of the SQL language for accessing and querying structured databases so that the most relevant data is gathered. For data science work, NoSQL databases are also often used, along with their corresponding software. As for all the data analytics tasks, a program like MS Excel is used by all data analysts, while a programming language like Python or Julia by all the data scientists. Data analysts use Tableau or some similar application for data visualization tasks, while data scientists employ a graphics library or two in the programming language they use. For all other tasks (e.g., putting things together), both kinds of data professionals use a programming language like Python/Julia.
Naturally, all of the tools mentioned previously have evolved and continue to evolve. The fundamental functionality may remain about the same, but new features and changes to the existing features are commonplace. For example, new libraries are coming about in programming languages, expanding the usefulness of the languages while creating more auxiliary tools for various data-related tasks. No matter how these tools evolve, the one thing that doesn't change is the mindset behind all the data analytics/science work. This mindset is the driving force of all such work and the ability to use whatever tools you have at your disposal to make something useful for the data at hand. The mindset is closely related to a solid understanding of the methodologies involved in data analytics/science.
For more information on this subject, particularly the mindset part, check out my book “data science mindset, methodologies, and misconceptions.” In it, I cover this topic in sufficient depth, and even if it is geared towards data science professionals, it can be useful to data analysts and other data professionals also. Cheers!
Ensembles are sophisticated machine learning models that comprise of other simpler models. They are quite useful when accuracy is the fundamental requirement, and computational resources are not a severe limitation. Ensembles are trendy in all sorts of current projects as they have significant advantages over conventional models. Let's take a closer look at this through this not-too-technical article.
Ensembles are an essential part of modern data science as they are more robust and powerful as models. Additionally, ensembles are ideal in cases of complex problems (something increasingly common in data science), as they can provide better generalization and more stability. In cases where conventional models fail to provide decent results, ensembles tend to work well enough to justify using data science in the problem at hand. That's why data scientists usually use them after all attempts to solve the problem with conventional models have failed.
The most common kind of ensembles is those based on the decision tree model. The main reason for this is because this kind of model is relatively fast to build and train, while it yields reasonably good performance. What's more, it's easy to interpret, something that bleeds to the ensemble itself, making it a relatively transparent model. Other ensembles are based on the combination of different models, belonging to different families. This heterogeneous architecture in the ensemble enables it to be more diverse in processing the data and to yield better performance. For all the ensembles out there, there needs to be a heuristic in place to figure out how the different outputs of the models that comprise the ensemble are fused. The most straightforward such heuristic is the majority voting, which works quite well in cases when all the ensemble models are similar.
The main drawback of ensembles is a particular comprise you have to make when using them. Namely, the transparency greatly diminishes, while in some cases, it disappears altogether. This phenomenon can be an issue when you need the model you built to be somewhat interpretable. Additionally, ensembles can overfit if the dataset isn't large enough or if they comprise of a large number of models. That's why they require special care and are better off being handled by experienced data scientists. Finally, ensembles require more computational resources than most data models. As a result, you need to have a good reason for using them in a large-scale scenario, since they can be quite costly.
You can learn more about ensembles and other machine learning topics in one of my recent books. Namely, in the Julia for Machine Learning book, published last year, I extensively cover ensembles and several other relative topics. Using Julia as the language through which the various machine learning models and methods are implemented, I examine how all this can be useful in handling a data science project robustly. Check it out when you have a moment. Cheers!
Questions are an integral part of scientific work, especially when research is concerned. Without questions, we can't have experiments and the capability of testing our observations with our current understanding of the world. Naturally, this applies to all aspects of science, including one of its most modern expressions: data science. This article will explore the value of questions in science, how they relate to hypotheses, and how data science comes into the picture.
Questions are what we ask to figure out how things fit in the puzzle of scientific work. They can be general or specific (the latter being more common), and they are always linked to hypotheses in one way or another. The latter are the formal expressions of questions and usually take a true or false label, based on the tests involved. Tests are related to experimentation, whereby we see how the evidence (data) accumulated fits the assumptions we make. Without questions and hypotheses, we cannot have scientific theories and anything else useful in this paradigm of thought. Note that all this is possible because we are open to being wrong and are genuinely curious to find out what's going on in the area we are investigating.
Questions and hypotheses are also vital because they are a crucial part of the scientific method, the cornerstone of science. Since its inception a few centuries ago, the scientific method has laid the framework of scientific work, particularly in forming new theories and developing a solid understanding of the world. It's closely tied to experimentation since, at its core, science is empirical and relies on observable data rather than predefined ideas. Although it's not as prominent as it used to be, the latter is in the realm of philosophy, although it has a role in science. The scientific method is precise and relies on a disciplined methodology in scientific work, but it's also open to creativity. After all, not all questions are deterministic and predictable; some of them may be led by a deeper understanding of the world, led by intuition.
But how does all this relate to data science? First of all, data science is the application of scientific methodologies in problems beyond scientific research. This way, it is a broader application of scientific principles and the scientific method, involving data from all domains. Because of this, it's crucial to think about questions and try to answer them as systematically as we can, using the various data analysis methodologies at our disposal. Not all of these methodologies are as clear-cut and easy as Statistics. Still, all of them involve some models that try to describe or predict what would happen when new data comes into play, something rudimentary in scientific work.
If you wish to learn more about this topic and data science's application of the scientific method, check out my latest book Doing Science Using Data Science. In this book, my co-author and I explore various science-related topics, emphasizing the practical application of data science methodologies, describing how these ideas are implemented in practice. The book is accompanied by a series of Jupyter notebooks, using Julia and Python. Check it out when you have a moment. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.