A-B testing plays a crucial role in traditional science as well as data science. It isn't easy to imagine a scientific experiment worth its time without A-B testing. It's such a useful technique that it features heavily in data analytics too. In this article, we'll explore this essential method of data analysis, focusing on its role in scientific work and data science.
In a nutshell, A-B testing uses data analysis to determine if two different samples are significantly different from each other, concerning a given variable. The latter is usually a continuous variable, used to examine how different the two samples are (it can be nominal too, however). The two samples often derive from a partitioning of a dataset based on another variable, which is binary. A-B testing is closely linked to Statistics, although any heuristic could be used to evaluate the difference between the two samples. Still, since Statistics yields a measurable and easy-to-interpret result in the form of a probability (p-value), it's often the case that particular statistical tests are used for A-B testing.
A-B testing is used heavily in scientific work. The reason is simple: since there are several hypotheses the analyst considers, it's often the case that the best way to test many of these hypotheses is through A-B testing. After all, this methodology is closely linked to the formation of a hypothesis and its testing, based on the data at hand. Naturally, the usefulness of A-B testing is also apparent in data science and data analytics during the data exploration stage.
The statistical tests used for A-B testing are t-tests, chi-square tests, and to a lower extent, z-tests. The t-test handles cases where a continuous variable is involved (e.g., Sales), while the chi-square one is geared towards nominal variables. Z-tests are very much like t-tests, but they are less powerful and make stronger assumptions about the data. All statistical tests yield a p-value as a result, which is compared to a predefined threshold (alpha), taking values like 0.05, 0.01, or 0.001. The lower the p-value, the more significant the result. Having a p-value lower than the alpha value means that you can safely disprove the Null Hypothesis (which states that any differences between the two samples are due to chance).
Note that A-B testing is a deep topic, and it's hard to do it justice in a blog article. Also, it requires a lot of practice to understand it thoroughly. So, if it sounds a bit abstract, that's normal, especially if you are new to Statistics. Cheers!
The data scientist and the data analyst both deal with data analysis as their primary task, yet those two roles differ enough to warrant an entirely different set of expectations for each. Both share common attributes and skills, however, making them more similar than people think. This similarity allows a relatively more straightforward transition from one role to another, if needed, something not everyone realizes. This article explores this situation's details and makes some suggestions as to how each role can benefit the other.
The two roles are surprisingly similar, in ways going beyond the surface kinship (i.e., data analysis). Data scientists and data analysts deal with all kinds of data (even though text data is not standard among data analysts), often directly from databases. So, they both deal with SQL (or some SQL-like language) to access a database and obtain the data needed for the project at hand. Both kinds of professionals deal with cleaning and formatting the data to some extent, be it in a programming language (e.g., Python or Julia), or some specialized software (e.g., a Spreadsheet program, in the case of data analysts). Also, both data scientists and data analysts deal with visuals and presentations containing these graphics. Finally, both kinds of professionals write reports or some form of documentation for their work and share it with the project's appropriate stakeholders.
Despite the sophistication of our field, we can learn some things from data analysts as data scientists. Particularly the new generation of data scientists, coming out of bootcamps or from a programming background, have a lot to benefit from these professionals. Namely, the data analysts are closer to the business side of things and often have domain knowledge that data scientists don't. After all, data analysts are more versatile as professionals in employability, making them more prone to gathering experience in different domains. Also, data analysts tend to have more developed soft skills, particularly communication, as they have more opportunities to hone them. Learning all that can benefit any data scientist, especially those who are new to the field.
Additionally, data analysts can learn from data science professionals too. Specifically, the value of an in-depth analysis that we do as data scientists are something every analyst can benefit from undoubtedly. In particular, data engineering is the kind of work that adds a lot of value in data science projects (when it's done right) and something we don't see that much in data analytics ones. What's more, predictive modeling (e.g., using modern frameworks, such as machine learning) is found only in data science, yet something a data analyst can apply. Once someone has the right mindset (aka the data science mindset), it's not too difficult to pick up those skills, particularly if they are already versed in data analytics.
If you wish to learn more about the soft skills and business-related aspects of data science, you can check out one of my relatively recent books, Data Scientist Bedside Manner. In this book, my co-author and I look into the organization hiring data scientist, the relevant expectations, and how such a professional can work effectively and efficiently within an organization. So, check it out if you haven't already. Cheers!
Robotic Process Automation, or RPA for short, is a methodology involving the automation of specific data-related processes through specialized scripts. RPA usually applies to low-level monotonous, and easy-to automate tasks, though lately, it has escalated into other tasks that are more high-level. RPA is quite popular for saving money, but it has its perils when it is in excess. Whatever the case, it dramatically impacts our field, so it's essential to know about it. This article is all about that.
Let's start by looking at the usefulness of RPA. RPA is useful in the extract, transform, and load (ETL) operations of all kinds. This kind of work involves moving the data from A to B, making changes to it, and making it available to other project stakeholders. Even in data science, there is a lot of ETL work involved, and often the data scientist is expected to undertake it. Naturally, RPA in ETL is very useful as it saves us time, time which we can spend on more challenging tasks, such as picking, training, and refining a model. Also, many data engineering tasks require a lot of time, so if they can be automated to some extent through RPA, it can make the whole project more efficient.
The danger of RPA when in excess becomes apparent when we try to automate the whole data science pipeline or even just individual parts of it. Take data engineering, for example. If we were to automate the whole thing, we'd be left with no say in what variables are worth looking into and what the data has to tell us. All initiatives from the data scientist would be gone, and the whole project would be mechanical and even meaningless. Some insights might still come about, but the project wouldn't be as powerful. The same goes for other parts of the data science workflow.
RPA is used in specialized frameworks for data modeling, the part of the pipeline that follows after data engineering. Systems like AutoML, for example, attempt to apply RPA in data modeling through various machine learning models. Although this may have some advantages over traditional approaches, in the long run, it may rob data scientists of their expertise and the personal touch of the models developed. After all, if everything becomes automated in data science, what's the point of having a human being at that role? Maybe certain occupations can be outsourced to machines, but it's not clear how outsourcing even the more high-level work to them will benefit the whole.
The good news is that even though RPA can undertake many aspects of data science work, it cannot replicate the data science mindset. This frame of mind is in charge of how we work the data and why. It's what makes our work worth paying for and what brings about real value from it. You can learn more about the data science mindset and other aspects of data science work from my book Data Science Mindset, Methodologies, and Misconceptions. Feel free to check it out when you have some time. Cheers!
Not to be confused with graphics (visuals), graphs are a data structure used in data science work. This data analysis approach is quite popular today, yet it's not covered adequately in the data science literature. The fact that it's a fairly advanced topic may have something to do with that. In any case, graphs are a powerful tool as they capture information without worrying about dimensionality, which, although manageable in most cases, can be challenging to overcome in more complex problems. This article will explore graphs, the theory behind them, their applications, and how they are stored.
Let's start by taking a look at graphs and the theory behind them. First of all, a graph is a representation of information in the form of a graphical structure consisting of nodes and arcs. Nodes represent entities of interest (e.g., people in a social media network), and arcs are the connections among these entities (e.g., an online friendship between them). The nodes and the arcs may have additional metadata attached to them, such as attributes (name of that person, date of birth, the duration of each friendship connection, etc.). The attribute that depicts the "strength" or "length" of an arc is referred to as its weight, and it's crucial for various graph algorithms. So, a graph can formally represent the relationships of a set of entities, along with any supplementary information involved. The mathematical framework of graphs and a series of useful theorems and heuristics are referred to as Graph Theory.
Graphs have lots of useful applications in data science work. When they are used for data analysis, they go by the term Graph Analytics. This kind of analytics is an integral part of data science, although it's not always essential for the average data science project. For example, if you care about analyzing a complicated situation, like the logistics of an organization, graphs may come in very handy. However, if you care about predicting the next quarter's sales, graphs may be overkill, plus the problem would be more easily solved through a time series regression model. Also, for problems involving lots of dimensions, graphs can be handy because they can model the data points' relationships through a similarity metric that they can use as the set of weights of the various arcs in a graph.
But where do we store these elaborate data structures? Fortunately, there are specialized databases for this task, aka graph databases. This kind of data storage and retrieval system allows for efficient encoding of the graphs and some useful operations with them. Neo4j is a popular such database, as well as a graph visualization tool. However, there are others more refined that are also able to manage enormous datasets. Most modern graph databases today are NoSQL databases that can handle other kinds of datasets, not just graph-based ones. A well-known such database is ArangoDB.
If you are interested in learning more about this and other useful data science methodologies, check out my book Data Science Mindset, Methodologies, and Misconceptions. In this manuscript, I delve into all kinds of methodologies related to data science work, including graphs and NoSQL databases. However, the focus is on the data science mindset, which is essential for any data scientist as well as remaining relevant in this field. Check it out when you have a moment. Cheers!
The data scientist role is an incredibly important one in the world today. Be it in for-profit organizations or non-profit ones, it has a lot of value to add and aid decision-making. However, it's still unclear what exactly it entails and how someone can become a data scientist starting from a data analytics background.
The data scientist is a tech professional who processes data, especially complex data, in large amounts (aka big data) to derive insights and build data products. This role involves gathering data, cleaning it up, combining it with other relevant data, evaluating the features involved, and building models based on them, usually to predict some variable of interest or solve some complex problem. It also involves creating insightful visuals and presenting your findings to the project stakeholders, with whom you often need to liaise throughout the data science projects. For all this work, you need to use a lot of programming and various data analysis methods, particularly machine learning.
To transition to the data scientist role from the data analyst one, you need to beef up your programming skills and work on your data analysis methodologies. Learning more techniques for pre-processing data (data engineering) is also essential. What's more, you need to familiarize yourself with various methods for depicting data, such as graphs, and how to process the data in this sort of encodings. Dimensionality reduction methods are also vital for assuming the data scientist role, just like various sampling techniques. Furthermore, handling data in different formats (e.g., JSON, XML, and text) is essential, particularly in projects that deal with semi-structured data. Naturally, having some familiarity with NoSQL databases is also very important, as it goes hand-in-hand with this sort of data.
Naturally, all this is the tip of the iceberg when it comes to transitioning into a data scientist from a data analyst. To make sure this transition is solid enough to build a career on top of it, you need to develop other skills and a good understanding of the complex data involved in data science projects. Being able to communicate with other data professionals well and understand them is also very important. Nowadays, you often have to work as part of a data science team, which involves a certain specialization level. So, having such expertise is significant, at least for certain data scientist positions.
You can learn more about this topic by reading my first book on data science, namely the Data Scientist: The Ultimate Guide to Becoming a Data Scientist one. This book covers various topics related to data scientist has a whole section dedicated to similar roles. It is also written in an easy-to-follow way, without too much technical jargon, while it also has a glossary at the end. Interviews with data scientists of various levels help clarify the role's details and how it is on a practical level. So, check it out when you have a moment. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.