MLOps Glossary

Artificial General Intelligence (AGI)

An AI system that can understand, learn, and apply knowledge across diverse domains, perform tasks with human-level competence, adapt to new situations, and exhibit reasoning, problem-solving, creativity, and social intelligence. AGI aims to surpass the limitations of narrow AI, which is designed for specific tasks or domains and lacks the ability to generalize knowledge or transfer skills to new situations.

Artificial Intelligence (AI)

The field of computer science focused on creating systems and algorithms that can perform tasks that typically require human intelligence. AI systems can process large amounts of data, learn from experiences, and adapt to new situations. AI includes a variety of subfields, such as machine learning, natural language processing, and computer vision.

AIOps

AI for IT operations, including ML for DevOps. This term was originally coined by Gartner and refers to the use of AI tools and technologies to improve IT operations. This means collecting, aggregating, and analysing the vast amount of data generated by IT components to provide value. It includes topics such as anomaly detection, root cause analysis, intelligent monitoring, and predictive maintenance.

AutoML

The approach of automating some or all stages of the ML development process from raw data to deployment of a fully trained and validated model. It can help non-experts to efficiently make use of machine learning techniques, and can also support experts to improve their productivity - for example, for prototyping models.

Chat-GPT

An AI language model developed by OpenAI that can understand and generate text in a conversational context. It can engage in back-and-forth exchanges, provide information, answer questions, offer suggestions, and assist with various tasks.

CI / CD

The combination of Continuous Integration (CI) and Continuous Delivery (CD). Sometimes CD can also refer to Continuous Deployment.

Continuous Delivery

The approach of automating the delivery of software changes. It builds on top of Continuous Integration and additionally includes deployment to development environments, integration tests, load tests, and other tests. The goal is to always be able to deploy a new version of the software. Deployment requires (manual) approval.

Continuous Deployment

The approach of automating the deployment of software changes to production environments. It builds on top of Continuous Integration and Continuous Delivery, but does not require manual approvals for the final release of the new software version.

Continuous Integration

The approach of automating the integration of software changes, including building and (unit) testing. It includes the process of continuously consolidating multiple sources into a single unified view.

Continuous Monitoring

The practice of observing and assessing an application continuously in real-time to ensure that it is performing as expected. The practice of constantly collecting, processing, and analyzing data from software systems, infrastructure, and machine learning models to ensure their performance, reliability, and security. Continuous monitoring allows for early detection of issues, proactive maintenance, and informed decision-making to optimize the system's performance and resources.

Continuous Training

The approach of continuously updating and improving machine learning models by regularly retraining them with new data. This process helps to maintain the accuracy and relevance of the models, allowing them to adapt to changing patterns in the data and address concept drift or data drift over time. Continuous training often involves automated pipelines for data ingestion, feature engineering, model training, validation, and deployment.

Continuous Verification

A practice which ensures the accuracy and high quality of the development and operations by detecting and eliminating errors. Automated tests are performed at every stage of the development process. By implementing a continuous verification strategy, organizations can ensure that their data, models, and code are verified in real-time, reducing the risk of incorrect decisions and outcomes.

Data-Centric AI

An emerging science that studies techniques to improve datasets, which is often the best way to improve performance in practical ML applications. While good data scientists have long practiced this manually via ad hoc trial/error and intuition, DCAI considers the improvement of data as a systematic engineering discipline.

Data Engineering

The process of designing, constructing, and maintaining the architecture and infrastructure necessary to collect, store, process, and analyze large datasets. Data engineers work on tasks such as data ingestion, data transformation, and data storage, ensuring that data is clean, reliable, and accessible for data scientists and other stakeholders.

Data Fabric

A technology layer and data curation service which integrates data from the underlying data layer(s) such as data lakes, data warehouses or databases, into a unified and holistic view of the data.

Data Imputation

The process of filling in missing or incomplete data values within a dataset. Data imputation aims to minimize the impact of missing data on analysis and machine learning models by using statistical techniques, such as mean imputation, median imputation, or more advanced methods like k-nearest neighbors or model-based imputation.

Data Lake

A centralized repository used to store data. Often contains both structured and unstructured data in raw formats, without enforced schemas. Need to make sure it does not evolve into a data swamp.

Data Mesh

An approach combining data architecture with a data operating model to enable sharing, accessing, and managing data products. Important principles also include domain ownership, data as a product, self-service data platforms, and federated computational governance.

DataOps

A set of practices and principles which help organizations improve the speed, quality, and reliability of their data analytics initiatives. The main components include Continuous Integration, Continuous Monitoring, and Continuous Verification.

Data Pipeline

A series of data processing steps that transform raw data into a format suitable for analysis, machine learning, or visualization. Data pipelines often involve multiple stages, such as data ingestion, data cleansing, data transformation, feature engineering, and data storage. They can be designed and managed using tools and frameworks that support automation, scalability, and fault tolerance.

Data Swamp

A collection of data with little organization, structure and oversight. Often happens when a Data Lake is poorly designed, managed, and/or documented.

Data Warehouse

A centralized repository used to store, organize, and curate data. As opposed to data lakes, it usually only contains processed (and validated) data, in relational form, with an enforced schema.

DevOps

A set of practices, principles, and tools that aims to improve the collaboration between software development (Dev) and IT operations (Ops) teams. DevOps emphasizes automation, continuous integration, and continuous delivery to shorten the software development life cycle, reduce deployment failures, and increase the speed and quality of software releases.

Error budget

Defines how much error, instability, or unreliability users accept over a rolling period of time. It is a budget which can be used to accommodate events such as planned and unplanned releases, or unavoidable hardware failures. Calculated as error budget = 100% - SLO. Should be accompanied by an error budget policy.

Error budget policy

A contractual agreement between business and developers which specifies what happens when little or no error budget is remaining. For example, it might mean that top priority shifts from feature development to addressing reliability issues.

Explainable AI (XAI)

The field of AI research that focuses on developing methods and techniques to make the decision-making process of AI systems more transparent, interpretable, and understandable to humans. Explainable AI aims to address the "black box" problem of complex models, such as deep learning, by providing insights into the factors influencing the model's predictions, which can help build trust, ensure fairness, and facilitate better decision-making.

Feature Store

A centralized repository used to store engineered features. It helps data engineers and data scientists to share, organize, and access computed feature values used for model training, validation, and inference.

Foundation Model

A large, pre-trained language model that serves as a fundamental building block for various natural language processing (NLP) tasks. These models are trained on massive amounts of text data to learn the statistical patterns and semantic relationships of language.

Generative AI

A branch of AI that focuses on creating models and systems capable of generating new content that is similar to or indistinguishable from human-generated content. These models are designed to learn patterns, characteristics, and structures from existing data and use that knowledge to generate new data that has similar properties.

Hyperparameter Tuning

The process of searching for the optimal set of hyperparameters that govern the behavior of a machine learning algorithm. Hyperparameter tuning can significantly impact the performance of a model and involves techniques such as grid search, random search, and Bayesian optimization to find the best combination of hyperparameter values.

Large Language Model (LLM)

A large language model is an artificial intelligence (AI) system designed to understand and generate human language. It is trained on massive amounts of text data and learns patterns and relationships within the data to generate coherent and contextually relevant responses. Large language models have a wide range of applications, including natural language understanding, machine translation, chatbots, content generation, language-based search, and virtual assistants.

LLMOps

The practices and processes involved in deploying, managing, and monitoring large language models in production environments. These operations aim to ensure the reliability, scalability, and efficiency of the models during real-world usage.

Machine Learning (ML)

A subset of AI that focuses on the development of algorithms and models that can learn from data and improve their performance over time. Machine learning enables computers to make predictions, recognize patterns, and make decisions without explicit programming, by using statistical techniques and mathematical optimization.

MLOps

The union of Machine Learning, Data Engineering, and DevOps.

A set of practices, principles, and tools that aims to streamline the development, deployment, and monitoring of machine learning models. MLOps combines aspects of machine learning, data engineering, and DevOps, focusing on reproducibility, automation, and continuous integration and delivery of models to facilitate collaboration between data scientists, data engineers, and IT operations teams.

Model Drift

The phenomenon where the performance of a machine learning model degrades over time due to changes in the underlying data distribution or the relationships between input features and target variables. Model drift can result from concept drift or data drift and often necessitates continuous monitoring, model retraining, and updating to maintain optimal performance.

Model Validation

The process of evaluating a machine learning model's performance using a set of metrics and validation techniques. Model validation typically involves splitting the dataset into training, validation, and testing subsets to assess the model's ability to generalize to new, unseen data. Common validation techniques include cross-validation, holdout validation, and bootstrapping.

Observability

The ability to understand the internal state of a system by examining its external outputs, such as logs, metrics, and traces. In the context of software systems and MLOps, observability is crucial for monitoring the performance, diagnosing issues, and optimizing the efficiency of deployed applications and machine learning models.

Service Level Agreement (SLA)

A contract between a service provider and a service consumer. It defines the expectations towards a service and the consequences for unmet expectations.

Service Level Indicator (SLI)

An objective metric for reliability. It can be understood as a proxy measurement for user happiness. High values should mean that most users are content, and low values should mean that most users are unhappy.

Service Level Objective (SLO)

Thresholds for SLIs which define if the system is performing reliably. It can be understood as the dividing line between happy and unhappy users. Should be stricter than any SLAs to which it relates.

Site Reliability Engineering (SRE)

The engineering discipline of creating highly scalable and reliable systems. It focuses on objectively managing the tradeoff between the requirements of different teams, namely fast releases (development teams), quality control (QA teams), and system reliability (operations teams). Important components include SLIs, SLOs, and error budgets.

Data, ML(Ops) & AI Glossary