Not to be confused with graphics (visuals), graphs are a data structure used in data science work. This data analysis approach is quite popular today, yet it's not covered adequately in the data science literature. The fact that it's a fairly advanced topic may have something to do with that. In any case, graphs are a powerful tool as they capture information without worrying about dimensionality, which, although manageable in most cases, can be challenging to overcome in more complex problems. This article will explore graphs, the theory behind them, their applications, and how they are stored.
Let's start by taking a look at graphs and the theory behind them. First of all, a graph is a representation of information in the form of a graphical structure consisting of nodes and arcs. Nodes represent entities of interest (e.g., people in a social media network), and arcs are the connections among these entities (e.g., an online friendship between them). The nodes and the arcs may have additional metadata attached to them, such as attributes (name of that person, date of birth, the duration of each friendship connection, etc.). The attribute that depicts the "strength" or "length" of an arc is referred to as its weight, and it's crucial for various graph algorithms. So, a graph can formally represent the relationships of a set of entities, along with any supplementary information involved. The mathematical framework of graphs and a series of useful theorems and heuristics are referred to as Graph Theory.
Graphs have lots of useful applications in data science work. When they are used for data analysis, they go by the term Graph Analytics. This kind of analytics is an integral part of data science, although it's not always essential for the average data science project. For example, if you care about analyzing a complicated situation, like the logistics of an organization, graphs may come in very handy. However, if you care about predicting the next quarter's sales, graphs may be overkill, plus the problem would be more easily solved through a time series regression model. Also, for problems involving lots of dimensions, graphs can be handy because they can model the data points' relationships through a similarity metric that they can use as the set of weights of the various arcs in a graph.
But where do we store these elaborate data structures? Fortunately, there are specialized databases for this task, aka graph databases. This kind of data storage and retrieval system allows for efficient encoding of the graphs and some useful operations with them. Neo4j is a popular such database, as well as a graph visualization tool. However, there are others more refined that are also able to manage enormous datasets. Most modern graph databases today are NoSQL databases that can handle other kinds of datasets, not just graph-based ones. A well-known such database is ArangoDB.
If you are interested in learning more about this and other useful data science methodologies, check out my book Data Science Mindset, Methodologies, and Misconceptions. In this manuscript, I delve into all kinds of methodologies related to data science work, including graphs and NoSQL databases. However, the focus is on the data science mindset, which is essential for any data scientist as well as remaining relevant in this field. Check it out when you have a moment. Cheers!
The data scientist role is an incredibly important one in the world today. Be it in for-profit organizations or non-profit ones, it has a lot of value to add and aid decision-making. However, it's still unclear what exactly it entails and how someone can become a data scientist starting from a data analytics background.
The data scientist is a tech professional who processes data, especially complex data, in large amounts (aka big data) to derive insights and build data products. This role involves gathering data, cleaning it up, combining it with other relevant data, evaluating the features involved, and building models based on them, usually to predict some variable of interest or solve some complex problem. It also involves creating insightful visuals and presenting your findings to the project stakeholders, with whom you often need to liaise throughout the data science projects. For all this work, you need to use a lot of programming and various data analysis methods, particularly machine learning.
To transition to the data scientist role from the data analyst one, you need to beef up your programming skills and work on your data analysis methodologies. Learning more techniques for pre-processing data (data engineering) is also essential. What's more, you need to familiarize yourself with various methods for depicting data, such as graphs, and how to process the data in this sort of encodings. Dimensionality reduction methods are also vital for assuming the data scientist role, just like various sampling techniques. Furthermore, handling data in different formats (e.g., JSON, XML, and text) is essential, particularly in projects that deal with semi-structured data. Naturally, having some familiarity with NoSQL databases is also very important, as it goes hand-in-hand with this sort of data.
Naturally, all this is the tip of the iceberg when it comes to transitioning into a data scientist from a data analyst. To make sure this transition is solid enough to build a career on top of it, you need to develop other skills and a good understanding of the complex data involved in data science projects. Being able to communicate with other data professionals well and understand them is also very important. Nowadays, you often have to work as part of a data science team, which involves a certain specialization level. So, having such expertise is significant, at least for certain data scientist positions.
You can learn more about this topic by reading my first book on data science, namely the Data Scientist: The Ultimate Guide to Becoming a Data Scientist one. This book covers various topics related to data scientist has a whole section dedicated to similar roles. It is also written in an easy-to-follow way, without too much technical jargon, while it also has a glossary at the end. Interviews with data scientists of various levels help clarify the role's details and how it is on a practical level. So, check it out when you have a moment. Cheers!
Data analytics and data science are both about finding useful (preferably actionable) insights in the data and helping with the decisions involved. In data science, this usually involves predictive models, usually in the form of data products, while the analysis involved is more in-depth and entails more sophisticated methods. But what tools do data analysts and data scientists use?
First of all, we need to examine the essential tasks that these data professionals undertake. For starters, they usually gather data from various sources and organize it into a dataset. This task is usually followed by the task of cleaning it up to some extent. Data cleaning often involves handling missing values and ensuring that the data has some structure to it afterward. Another task involves exploring the dataset and figuring out which variables are the most useful. Of course, this depends on the objective, which is why data exploration often involves some creativity. The most important task, however, is building a model and doing something useful with the data. Finally, creating interesting plots is another task that is useful throughout the data analyst/scientist's workflow.
You can do all the tasks mentioned earlier using a variety of tools. More specifically, for all data acquisition tasks, a SQL piece of software is used (e.g., PostgreSQL). This sort of software makes use of the SQL language for accessing and querying structured databases so that the most relevant data is gathered. For data science work, NoSQL databases are also often used, along with their corresponding software. As for all the data analytics tasks, a program like MS Excel is used by all data analysts, while a programming language like Python or Julia by all the data scientists. Data analysts use Tableau or some similar application for data visualization tasks, while data scientists employ a graphics library or two in the programming language they use. For all other tasks (e.g., putting things together), both kinds of data professionals use a programming language like Python/Julia.
Naturally, all of the tools mentioned previously have evolved and continue to evolve. The fundamental functionality may remain about the same, but new features and changes to the existing features are commonplace. For example, new libraries are coming about in programming languages, expanding the usefulness of the languages while creating more auxiliary tools for various data-related tasks. No matter how these tools evolve, the one thing that doesn't change is the mindset behind all the data analytics/science work. This mindset is the driving force of all such work and the ability to use whatever tools you have at your disposal to make something useful for the data at hand. The mindset is closely related to a solid understanding of the methodologies involved in data analytics/science.
For more information on this subject, particularly the mindset part, check out my book “data science mindset, methodologies, and misconceptions.” In it, I cover this topic in sufficient depth, and even if it is geared towards data science professionals, it can be useful to data analysts and other data professionals also. Cheers!
Ensembles are sophisticated machine learning models that comprise of other simpler models. They are quite useful when accuracy is the fundamental requirement, and computational resources are not a severe limitation. Ensembles are trendy in all sorts of current projects as they have significant advantages over conventional models. Let's take a closer look at this through this not-too-technical article.
Ensembles are an essential part of modern data science as they are more robust and powerful as models. Additionally, ensembles are ideal in cases of complex problems (something increasingly common in data science), as they can provide better generalization and more stability. In cases where conventional models fail to provide decent results, ensembles tend to work well enough to justify using data science in the problem at hand. That's why data scientists usually use them after all attempts to solve the problem with conventional models have failed.
The most common kind of ensembles is those based on the decision tree model. The main reason for this is because this kind of model is relatively fast to build and train, while it yields reasonably good performance. What's more, it's easy to interpret, something that bleeds to the ensemble itself, making it a relatively transparent model. Other ensembles are based on the combination of different models, belonging to different families. This heterogeneous architecture in the ensemble enables it to be more diverse in processing the data and to yield better performance. For all the ensembles out there, there needs to be a heuristic in place to figure out how the different outputs of the models that comprise the ensemble are fused. The most straightforward such heuristic is the majority voting, which works quite well in cases when all the ensemble models are similar.
The main drawback of ensembles is a particular comprise you have to make when using them. Namely, the transparency greatly diminishes, while in some cases, it disappears altogether. This phenomenon can be an issue when you need the model you built to be somewhat interpretable. Additionally, ensembles can overfit if the dataset isn't large enough or if they comprise of a large number of models. That's why they require special care and are better off being handled by experienced data scientists. Finally, ensembles require more computational resources than most data models. As a result, you need to have a good reason for using them in a large-scale scenario, since they can be quite costly.
You can learn more about ensembles and other machine learning topics in one of my recent books. Namely, in the Julia for Machine Learning book, published last year, I extensively cover ensembles and several other relative topics. Using Julia as the language through which the various machine learning models and methods are implemented, I examine how all this can be useful in handling a data science project robustly. Check it out when you have a moment. Cheers!
Questions are an integral part of scientific work, especially when research is concerned. Without questions, we can't have experiments and the capability of testing our observations with our current understanding of the world. Naturally, this applies to all aspects of science, including one of its most modern expressions: data science. This article will explore the value of questions in science, how they relate to hypotheses, and how data science comes into the picture.
Questions are what we ask to figure out how things fit in the puzzle of scientific work. They can be general or specific (the latter being more common), and they are always linked to hypotheses in one way or another. The latter are the formal expressions of questions and usually take a true or false label, based on the tests involved. Tests are related to experimentation, whereby we see how the evidence (data) accumulated fits the assumptions we make. Without questions and hypotheses, we cannot have scientific theories and anything else useful in this paradigm of thought. Note that all this is possible because we are open to being wrong and are genuinely curious to find out what's going on in the area we are investigating.
Questions and hypotheses are also vital because they are a crucial part of the scientific method, the cornerstone of science. Since its inception a few centuries ago, the scientific method has laid the framework of scientific work, particularly in forming new theories and developing a solid understanding of the world. It's closely tied to experimentation since, at its core, science is empirical and relies on observable data rather than predefined ideas. Although it's not as prominent as it used to be, the latter is in the realm of philosophy, although it has a role in science. The scientific method is precise and relies on a disciplined methodology in scientific work, but it's also open to creativity. After all, not all questions are deterministic and predictable; some of them may be led by a deeper understanding of the world, led by intuition.
But how does all this relate to data science? First of all, data science is the application of scientific methodologies in problems beyond scientific research. This way, it is a broader application of scientific principles and the scientific method, involving data from all domains. Because of this, it's crucial to think about questions and try to answer them as systematically as we can, using the various data analysis methodologies at our disposal. Not all of these methodologies are as clear-cut and easy as Statistics. Still, all of them involve some models that try to describe or predict what would happen when new data comes into play, something rudimentary in scientific work.
If you wish to learn more about this topic and data science's application of the scientific method, check out my latest book Doing Science Using Data Science. In this book, my co-author and I explore various science-related topics, emphasizing the practical application of data science methodologies, describing how these ideas are implemented in practice. The book is accompanied by a series of Jupyter notebooks, using Julia and Python. Check it out when you have a moment. Cheers!
A data product is the main deliverable of data science and some data analytics projects. It involves developing a stand-alone piece of software, often with a data model under the hood. Other times, it takes the form of a set of visualizations that depict particular variables of interest or other useful insights. In any case, data products are vital as they constitute an essential part of a data science project and a useful deliverable in a data analytics project (even if it's not always a requirement).
Dashboards are a kind of data product, featuring graphics and an intuitive (albeit minimalist) interface. They sometimes involve some control element that enables the user to change some settings and adjust the related graphics to different operating conditions. This element provides a more dynamic aspect to the dashboard, which augments the innate dynamism they have. The latter stems from the fact that they are usually linked to a dataset that changes over time, as new data becomes available.
The popularity of dashboards illustrates data visualization's value, be it in data science or data analytics. It's hard to imagine a project like this without some visuals, pinpointing important insights and other findings. Additionally, whenever predictive models are involved, specialized visuals for showcasing the models' performance are a must. That's why data visualization as a sub-field of data science and data analytics has grown, especially in the past few years. The development of professional software undertaking such tasks and specialized libraries in various programming languages have contributed to this growth.
Beyond data visualization, however, other subtle aspects of the data science and data analytics fields are essential but less pronounced in the various educational material out there. For example, the communication of insights and using the visuals mentioned earlier in presentations is something every data professional ought to know. This point is particularly important when you need to liaise with non-technical people, whether colleagues or clients. Also, managing a data analytics project can be challenging, especially in the modern Agile-driven workplace. After all, most data analytics projects today are all about teamwork and tight deadlines, and changing requirements. What's more, although a dashboard is a powerful asset in an organization, it needs to be maintained periodically and fed good-quality data. The latter requires additional work and proper data governance, which not everyone involved in this field is usually aware of, unfortunately.
My Data Scientist Bedside Manner book, which I co-authored last year, is an excellent resource for this kind of topic. Although written for data science professionals mainly, it can be useful to all sorts of data analysts and people involved in data-driven projects (e.g., managers). The idea is to bridge the gap between technical and non-technical professionals in an organization and leverage data analytics work effectively. This is an excellent reference book that every data professional can benefit from in the years to come. Cheers!
Programming, particularly in languages like Julia, Python, and Scala, is fundamental in data science. It enables all kinds of processes, such as data engineering (particularly ETL tasks), data modeling, and even data visualization, to name a few. If you know what you are doing, you can also solve practical problems through programming (e.g., optimization tasks) by modeling them appropriately. It's a versatile tool with a lot of potential, especially once you get used to it and see it as an extension of your mind. However, it's not as simple as putting bits and pieces of programming code together. This article will examine the various strategies for coding concerning the various objectives we may have.
Let's start with the most intuitive kind of objective, namely getting things done as quickly as possible. This strategy may be suitable for solving problems that need a solution once, so the code doesn't need to be revised or reused again. This sort of strategy involves writing code that works to prove a particular concept before solving the task more seriously. Efficiency isn't pursued, nor is readability and the use of lots of comments, to explain what's happening. This strategy is typical for solving a drill or a relatively simple problem that you don't need to present to the project stakeholders. As a result, using this strategy for other scenarios is a terrible idea.
Another objective we may have when writing a script is efficiency. When we process lots of data, we don't want to have lazy code that takes a while to finish the task at hand. So, optimizing the code for efficiency through smart memory allocations, static typing, using the appropriate variable types, etc. can help with that. This programming strategy is quite a common one that can save us a lot of time. However, it's also useful when deploying this code at scale, since it means that we'll be using fewer computational resources (CPU/GPU power and memory), lowering the cost of the project at hand.
Interpretability and maintainability are a different objective altogether, tied to the final programming strategy. So, if you want your program to be easy to read and understand, making it easy to update when necessary, you opt for this strategy. It involves organizing your code to break the problem down into simple tasks handled in different classes and functions, including lots of comments explaining your reasoning and what different functions do. Naming the variables in an intuitive way is also a big plus, even if that makes the code longer at times. In any case, such code is built to last since it's easy to maintain and helps newcomers that view it adopt good practices when writing their own code.
Naturally, you can use a combination of the above strategies for your project. Not all of them play ball together, of course, but you can still make a script that's efficient and easy to understand/maintain. So, unless pressed with time, it's good to have such an approach to your programming, adapting it to each project's requirements.
If you wish to learn more about programming and how it applies to data science, you can check out one of my latest books, Julia for Machine Learning. This book explores how the Julia programming language can be used to tackle various data science problems, using machine learning models and heuristics. Accompanied by a series of examples in Jupyter notebooks and script files, it illustrates in quite comprehensible code how you can implement this framework for your data science work. So, check it out when you have a moment. Cheers!
Machine Learning is the field involved in using various algorithms that enable a machine (typically a computer) to learn from the data available, without making any assumptions about it. It includes multiple models, some simpler, others more advanced, that go beyond the statistical analysis of the data. Most of these models are black-boxes, though a few exhibit some interpretability. Yet, despite how well-defined this field is, several misconceptions about it conceal it in a veil of mystique.
First of all, machine learning is not the same as artificial intelligence (A.I.). There is an overlap, no doubt, but they are distinct fields. You can spend your whole life working in machine learning without ever using A.I. and vice versa. The overlap between the two takes the form of deep learning, the use of sophisticated artificial neural networks that are leveraged for machine learning tasks. Computer Vision is an area of application related to the overlap between machine learning and A.I.
What’s more, machine learning is not an extension of Statistics. Contrary to what many Stats fans say, machine learning is an entirely different field distinct from Statistics. There are similarities, of course, but they have fundamental differences. One of the key ones is that machine learning is data-driven, i.e., it doesn't use any mathematical model to describe the data at hand, while Statistics does just that. It's hard to imagine Statistical models without a data distribution or some function describing the mapping, while machine learning models can be heuristics-based instead.
Nevertheless, machine learning is not purely heuristics-based and, therefore, void of theoretical foundations. Even if it doesn't have the 200-year-old amalgamation of the Statistics theory, machine learning has some theoretical standing based on the few decades of research on its back. Many of its methods rely on heuristics that "just work," but it's not what people consider alchemy. Machine learning is a respectable scientific field with lots to offer both to the practitioner and the researcher.
Beyond the misconceptions mentioned earlier, there are additional ones that are worth considering. For example, machine learning is not plug-and-play, as some people think, no matter how intuitive the corresponding libraries are. What's more, machine learning is not always the best option for the problem at hand, since some projects are okay with something simple that's easy to understand and interpret. In cases like that, a statistical model would do just fine.
It's hard to do this topic justice in a single blog post, but hopefully, this has given you an idea of what machine learning is and what it isn't. I talk more about this subject in one of my most recent books, Julia for Machine Learning. Additionally, I plan to cover this topic in some depth in a 90-minute talk at the next Data Modeling Zone conference in Belgium this April. I hope to see you there! Cheers.
In a nutshell, Quantum Computing is the computing paradigm that uses quantum properties in computer systems, such as superposition and quantum tunneling. Experts consider quantum computing quite advanced and the state-of-the-art of computing today, even if the specialized hardware it uses makes it a bit of a niche case study. Despite its numerous merits, quantum computing is not a panacea, even though it is considered relevant in our field, particularly in A.I. This article will explore this relationship, where Q.C. is right now, and where you can access it.
Quantum computers' performance is usually measured in Qubits (quantum bits) instead of traditional bits. Each qubit is a quantum particle in superposition and corresponds to the more rudimentary piece of data a quantum computer can handle. Qubits are not easy to maintain, and when they work in tandem, it's quite probable for the superposition to collapse unexpectedly, resulting in errors in the computations involved. So, having a certain number of qubits (the larger, the better) in a computer is quite an accomplishment. Larger numbers enable quantum computer users to tackle more challenging problems, potentially adding more value to the project at hand.
Right now, quantum computing is at a stage where the number of qubits they can handle is in the two digits. For example, IBM's quantum machine boasts 65 qubits, although the company has plans for much larger numbers soon (they expect to have a quantum machine with 1000+ qubits by 2023). However, it's important to note that each company uses a somewhat different approach, meaning that the qubits in each computer they produce are not directly comparable to each other.
Now, what about the potential disruption in data science and A.I. work due to quantum computing? Well, since our field often involves lots of heavy computations, some of them around NP-hard problems in combinatorics (e.g., selected the optimal set of features from a feature set), it can surely gain from quantum computers. That's not to say, however, that every data science or A.I. project can experience a boost from the Q.C. world. Simpler models and standard ETL work are bound to remain the same while using a quantum machine for them would waste these pricy computational resources. So, it's more likely that a combination of traditional computing and quantum computing will be normal once quantum machines become more commonplace. Additionally, for optimization-related problems, particularly those involving many variables, quantum computing may have a lot to offer. Still, whether it's worth the price is something that needs to be determined on a case-by-case basis.
Let’s now look at the various quantum computing vendors out there. For starters, we have Amazon with its AWS Quantum Computing center at Caltech. Microsoft is also a significant player, with its Azure Quantum service, utilizing a specialized language (Q#). IBM is a key player, too, along with D-Wave Systems, the two being the first to develop this technology. Google Research is Alphabet's division for this tech and is now also a player in this area. What's more, there are hardware companies too in this game, such as Intel, Toshiba, and H.P. Naturally, all the companies that have developed their Q.C. product enough to make it available do so via a cloud, since it's much more practical this way. For those who like the cloud but don't have the budget or the project that lends itself to quantum machines, the Hostkey cloud provider relies on conventional computers, including some with GPUs onboard.
You can learn more this and other relevant topics to A.I. and data science, though my book A.I. for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond. In this book, my co-author and I cover various aspects of data science work related to A.I., as well as A.I.-specific topics, such as optimization. What’s more, the book has a hands-on approach to this subject, with lots of code in both Python and Julia. So, check it out when you can. Cheers!
Like any project in an organization, a data science project needs to have boundaries regarding how many resources are allocated to it. This resource usage translates into a monetary cost that takes the form of a budget. So, even if many professionals in this field are not aware of it, budgeting plays an essential role in every data science project and provides the framework through which it can manifest.
Despite its similarities to other projects, data science projects differ in many ways. First of all, its return isn't clear or even guaranteed. A data science team may investigate a dataset for insights or predictive potential, but it may not dig up something worthwhile. The data in a company's databases may be useful for its day-to-day tasks but useless for anything data science-related. It's next to impossible to know if there is anything worthwhile beforehand, so when starting a project, you have to take significant risks. As for the project's time frame, that's also highly uncertain, especially if it's a new project. This uncertainty can veer the project off-course, and the risk of going over-budget is substantial.
When creating a budget for a data science project, several factors are considered to mitigate the risk of failure. First of all, you need to have a clear plan of what you expect from the data science team to find. If possible, you could have some ideas as to how you could translate these findings into a value-add, be it through a revenue stream, some improvement in the customer/user experience, or some enhancement in the organization's workflow efficiency.
What’s more, it's good to examine a data science project from various perspectives and ensure that the data scientists involved have peace of mind when working on it. It's not just up to them to make it work, since the other stakeholders have a responsibility in it too. For example, the data owners need to do their part and ensure that the data science teams receive all the data it needs promptly. The developers involved need to have sufficient bandwidth to help with any ETL and other project processes. Finally, the business people need to have realistic expectations of what the data science team can deliver and how to leverage their work.
Beyond the above factors that you need to consider, some additional considerations are useful to have when creating a budget. For instance, the cost of cloud computing involved is something that can get out of hand quickly, especially if you are not used to working at a particular scale of data. Sometimes it's more effective to have dedicated servers available to you instead of a leasing computing power in a virtual machine. Also, it would make sense to start with a proof-of-concept project to gain a better understanding of the problem at hand before going at it with full force.
You can learn more about this, and the other less technical aspects of data science work through one of the books I co-authored relatively recently. Namely, the Data Scientist Bedside Manner book delves into this topic and explores data science from a perspective few people consider. Using information from various sources, including some experienced professionals in the field, provides guidance both for the data-driven manager and the data science professional, bridging the gap between the two. Check it out when you have the chance. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.