Data Literacy

Introduction and Survey

Welcome to the data literacy part of Foxy Data Science! This is a great starting point in your data literacy journey, something we can all improve and benefit from.

While I'm developing this section with relevant content for you, feel free to fill in the following survey I've put together, to help me better understand where you stand and how I could be of help to you, in your data literacy endeavors (to help cultivate a data culture in your team/org and foster a more data-centric approach to decision-making). BTW, if you are on a mobile device and the survey below doesn't render well, you can try this link instead. Cheers!

Some Definitions so that we are all on the same page regarding Data Literacy

Fields and Domains

Data Literacy is the ability to read, write and communicate data with other stakeholder, keeping the communication in context. It involves an understanding of the data sources/assets and constructs, analytics-related methods and techniques, and the ability to describe the business value or outcome of such applications.

Data Governance involves organizing, managing, and monitoring the integrity and security of data in an enterprise’s system. Its business requirement involves establishing policies and frameworks to facilitate these processes, ensuring that any new or existing data complies with current internal and external regulatory standards (e.g., GDPR, PECR, HIPAA, etc.).

Machine learning (ML) is a methodology that leverages specific algorithms to analyze datasets and predict outcomes in a data-driven fashion. These algorithms are known as machine learning algorithms and span over several use cases. ML is often linked to AI though there are ML algorithms that are simpler and easier to interpret than the AI-based ones.

A data democracy describes a methodological framework of values and actions that benefit the public or the typical user, minimizing any harm related to them. Organizations like Data for Democracy, initiated by Bloomberg and BrightHive, and projects like Data for Democracy, established by the University of Washington to help Myanmar transition to a data democracy, are spearheading this framework.

Artificial Intelligence (AI) is a field of computer science dealing with the emulation of human intelligence using computer systems and its applications on a variety of domains. The application of AI on data science is noteworthy and an important factor in the field, especially since the 2000s. AI comes in various shapes and forms and it’s closely related to machine learning these days.

Data analytics is a general term to describe the field involving data analysis as its main component. Data analytics is more general than data science, although many people often use the two terms interchangeably.

Data science is the interdisciplinary field undertaking data analytics work on all kinds of data, with a focus on big data, for the purpose of mining insights and/or building data products. Data science includes machine learning as well as other data analytics frameworks.

Data ethics is an emerging part of data fields related to the ethical concerns around modern data technologies, especially AI. Data ethics also involves privacy-related matters, the use of sensitive data in data products, etc.

Data mining is the process of finding patterns in data, usually in an automated way. Data mining is a data exploration methodology, and it is often seen as the precursor of data science.

Roles of Various Data Professionals

A Chief Data Officer (CDO) is a C-level executive who helps bridge the gap between data technologies and business. This person evangelizes an enterprise-wide Data Management strategy at a senior level. The CDO leads Data Management initiatives, enabling an organization to leverage its data assets and gain a competitive advantage from them. A CDO tends to be part business strategist, adviser, and leader of project managers involved in various data-related initiatives in an organization.

Data analysts are information workers who help businesses make better, more informed decisions by not only collecting and investigating data, but also translating the relevant information into useful observations for their customers. The latter includes technical staff, business teams, and leadership. In many organizations, data analyst and data scientist roles share similarities, but data analysts focus on speedily interpreting data, performing exploratory data analysis, creating dashboards, and working with structured data.

A data architect is a data professional who provides clear specifications, models, and definitions, translating a business’ Data Strategy into a Data Architecture and implementing this structure to align with an organization’s Data Governance. An architect is one who designs and advises on the construction of a data entity, such as a database.

Data engineers are information workers geared towards building Data Architecture through infrastructures and foundations. A data engineer is tasked with designing and maintaining the architecture of data systems, which incorporates concepts ranging from analytic frameworks to data warehouses. Responsibilities also include configuring, managing, and scaling data pipelines. Data engineers tend to have a programming background (e.g. Java, Scala or Python) and focus on low-lever work, such as ETL.

A data modeler is a person who models data or documents software and business system designs. By doing this, a data modeler translates business needs into technical specifications for other data professionals to work with. Developers and other IT members benefit from this when creating new data systems and troubleshooting and maintaining them. Data models promote consensus among developers, customers, and other stakeholders.

Data scientists are data professionals who emphasize rigor and performance when obtaining, scrubbing, exploring, modeling, and interpreting data. Data scientists provide a different context than data analysts to their work, through high-powered math and more in-depth analysis of the data (e.g., predictive modeling). Josh Wills, a software engineer, once described a data scientist as a “person who is better at statistics than any software engineer and better at software development than a statistician.” However, in practice there are other skills involved in this role, such as business acumen.

A database administrator (DBA) is a person who manages, maintains, and secures data in one or more data systems so that a user can perform analysis for business operations. DBAs take care of data storage, organization, presentation, utilization, and analysis from a technical perspective.

Other terms

Data involves anything potentially containing information. Data is the prima materia of data science and data analytics in general, taking various forms. All data is eventually transformed into numbers, as it’s easier to analyze it this way.

An algorithm is a step-by-step procedure for calculations and logical operations. In a data science setting, algorithms can be designed to facilitate data science and acquire knowledge by themselves, rather than relying on hard-coded rules. Modern algorithms in data science often rely on heuristics for their evaluations and decisions.

Data regulations are policies and laws ensuring that processed data is shared or governed properly. This involves the right data assets going to the right place at the right time. Data regulation are part of Data Governance.

Microservices are a flexible, single-purpose software program essentially compatible with any application. They are like the various appliances in a household. Each one of them is more or less independent of the others but together they may offer a unique user experience and solve a problem that none of these appliances can solve on their own. Microservices are like that.

A business glossary differs from a data dictionary in that its focal point, Data Governance, goes beyond a data warehouse or database. A business glossary is a means of sharing the internal vocabulary used within an organization. Most business glossaries share certain characteristics such as standard data definitions and documentation of them. Business glossaries are necessary for an efficient partnership between data professionals and the business stakeholders.

A chatbot is an AI system based on natural language processing (NLP). It aims to provide human-friendly responses to questions and statements typed in a human language. Chatbots are used for automating customer service processes, particularly on websites, as well as entertainment. Lately, they have been used for accessing knowledge bases and other information repositories.

Data lineage is a kind of data life cycle that includes the data’s origins and where it moves over time. It can also describe what happens to data as it goes through diverse processes. Data lineage can help with efforts to analyze how information is used and to track key bits of information that serve a particular purpose.

Documentation involves any relevant material (usually text) that accompanies some algorithm or metric. This may include examples of its usage, limitations, and known bugs in the current implementation of it.

Exploratory Data Analysis (EDA) is part of the data science pipeline that involves exploring the data at hand and understanding the variables involved. EDA is essential as it provides valuable insights about the dataset and helps drive the development of data models. If done properly, EDA heavily relies on heuristics.

Generalization is a key characteristic of a data science model, where the system is able to handle data beyond its training set reliably. A proxy to good generalization is similar performance between the training set and a testing set, as well as consistency among different training-testing set partitions of the whole dataset.

A Heuristic is an empirical metric or algorithm/function that aims to provide some useful tool or insight, to facilitate a method or project of data science or artificial intelligence. Heuristics are entirely data-driven and focus on performing a very specific task in an efficient and scalable manner.

Information is a general term that is usually linked to any useful signal in a piece of data that can be communicated in other means. Information can also be seen as distilled data and is more high-level than data and closer to our understanding. From a technical standpoint, information is the useful aspects of a signal or transmission, something studied by the Information Theory developed by Claude Shannon.

Interpretability is the ability to more thoroughly understand a data model’s outputs and derive how they relate to its inputs (features). Lack of interpretability is an issue for deep learning systems as well as many machine learning systems in general. We often use interpretability and transparency interchangeably.

Noise is all the parts of the dataset that don’t add any value to data science work due to their random nature. Noise is usually handled in the data engineering phase and is in contrast with the signal of the dataset.

Pipeline (also known as workflow) is a conceptual process involving a variety of steps, each one of which can consist of several other processes. A pipeline is essential for organizing the tasks needed to perform any complex procedure (often non-linear) and is very applicable in data science (this application is known as the data science pipeline).

Sampling is the process of acquiring a sample of a population using a specialized technique. Sampling is very important to be done properly, to ensure that the resulting sample is representative of the population studied. Sampling needs to be unbiased, something usually accomplished by making it random. Nevertheless, there are way to perform sampling deterministically, with certain heuristics.

Sensitivity Analysis is the process of establishing how stable a result is or how prone a model’s performance is to change, if the initial data is different. It involves several methods, such as re-sampling, “what if” questions, etc.

Signal is the underlying gist of a dataset, which depicts the information within the data at hand. The signal is not easily measurable, and it’s often contradicted to the noise of the data.

A data catalog is a metadata inventory that centralizes access to all of an organization’s available data assets. This repository facilitates dataset search and retrieval enabling users and systems to find the information needed for business easily. A data catalog differs from a data dictionary in its ability to search and retrieve information.

A data container is a transportation solution for a database required to run from one computer system to another. In essence, it is a data structure that “stores and organizes virtual objects (a virtual object is a self-contained entity that consists of both data and procedures to manipulate the data).”

A data dictionary is a description of data in business terms. It also entails information about the data such as data types, details of structure, privacy and security restrictions, etc. Unlike business glossaries, which focus on data across the organization, data dictionaries support data warehouses by defining how to use them.

A data lake is an environment where a vast amount of data, of various types and structures, can be ingested, stored, assessed, and analyzed. Data lakes serve many purposes, including: An environment for data scientists to mine and analyze data. It’s essentially a central storage area for raw data, with minimal, if any transformation, used by data engineers and data scientists.

A data mart is a subset of a data warehouse designed to service a specific business line or purpose. Data warehousing pioneer Ralph Kimball conceived of data marts to “begin with the most important business aspects or departments.”

Data silos describe isolated data islands that appear or are discovered upon finding disjointed Data Management components. These include: Systems that cannot programmatically work with other systems because of older or incompatible code, Fixed data that is controlled by one department or team but is cut off from the rest. Data silos have often had a negative connotation.

A data warehouse is an implementation used to provide decision-support data and aid workers engaged in reporting, query, and analysis. This architectural technology enables organizations to integrate data from a range of sources into common data models. Data warehouses provide insight into operational processes and open new possibilities to leverage data towards making decisions and providing value in general.

A data lakehouse is a data management architecture that combines the benefits of a traditional data warehouse and a data lake. It aims to merge the data warehouses’ ease of access and support for enterprise analytics capabilities with the the data lakes’ flexibility and relatively low cost.

A digital transformation strategy is a detailed plan to improve and innovate existing business processes using digital technologies. Organizations that can refine their operations, leveraging automation can better support customers and allow employees to focus on activities requiring a more hands-on approach.

A digital twin is a virtual device that contains the exact state, information, and organization of the physical device to which it is connected. You can think of it as a test version before proceeding with product development or interacting with a real-world device. Digital twins go hand-in-hand with implementing the Internet of Things (IoT).

A knowledge graph, is a kind of ontology depicting “knowledge in terms of entities and their relationships,” (according to GitHub). Knowledge graphs developed from the need to do something with or act upon information based on context.

A metadata repository is a software tool that stores descriptive information about the data model used to store and share metadata (information about the data at hand). Metadata repositories combine diagrams and text, enabling metadata integration and change. The metadata repository’s power lies with the easily accessible way people can view and navigate its contents.

A black box is a predictive analytics model or process that is not in any way transparent or comprehensible in how it arrives to its predictions. Many machine learning models are black boxes, contrary to Statistical models that are generally transparent.

Classification is a very popular data science methodology, under the predictive analytics umbrella. Classification aims to solve the problem of assigning a label (aka class) to a data point, based on pre-existing knowledge of categorized data, available in the training set.

Clustering is a data science methodology involving finding groups in a given dataset, usually using the distances among the data points as a similarity metric. Clustering is an unsupervised learning methodology. As a problem, clustering is NP-hard, and it’s frequently tackled with heuristics and metaheuristics.

Regression is a very popular data science methodology, under the predictive analytics umbrella. Regression aims to solve the problem of predicting the values of a continuous variable corresponding to a set of inputs, based on pre-existing knowledge of similar data, available in the training set.

A data model is a data science module processing and/or predicting some piece of information, using existing data, after the latter has been preprocessed and made ready for this task. Data models add value and are comprised of non-trivial procedures. In AI, data models are usually sophisticated systems making use of several data-driven processes under the hood.

Robotic process automation (RPA) is a method or technique enabling the integration of software or the automation of work processes, through the use of the same user interface. RPA is often used in combination with somewhat “clever” scripts, but it doesn’t qualify as AI, at least for the majority of use cases. Also, RPA has nothing to do with Machine Learning.

Cloud (computing) is a paradigm that enables easy, on-demand access to a network of shareable computing resources that can be configured and customized to the application at hand. The cloud is a very popular resource in large-scale data analytics and a common resource for data science applications.

ETL (Extract, Transform and Load) is a process in all data-related pipelines, having to do with pulling data out of the source systems (usually databases) and placing it into a data warehouse or a data governance system. ETL is an important part of data acquisition, preceding any data modeling efforts. ETL is under the data engineering umbrella on a very low level, and as a task, it is undertaken by a specialist, usually having a strong programming background.

GDPR is an EU legislation related to the acquisition of private data (mostly PII) and the need for permissions from the customers regarding the use of that data. Even though GDPR applies to EU countries only, several other countries have adopted its policies, partly because you still need to abide by them if you want to do business with people living in the EU.

Data literacy for professionals

Introduction and Survey

Some Definitions so that we are all on the same page regarding Data Literacy