In this article, we will walk through the stages of AI adoption for Azure shops that have the data necessary to start training/serving models.
Surprisingly, the statistics show that many companies spend around a year from the decision to the live production deployment. Another evidence is that 79% of AI projects fail because of the infrastructure. MLOps is evolving to solve these problems, and the best practices are already getting some shape.
We want to share our experience from an engineering and management standpoint to contribute to our community. Although this article describes the tools used in Microsoft Azure, we want to mention our partners. DVC's tools are excellent for experiment tracking and model traceability. Netris is the only tool that brings VPC-like network management to OnPrem platforms.
First things first: Data
Data is the new oil. A few years ago, there was still a debate about data and algorithm priorities. ChatGPT and all the other tools proved that data wins. The higher the number of quality data points, the better the resulting model. The algorithms and network architectures are still improving; however, the phase of initial fundamental ideas like back-propagation is already behind us. Most papers are LEGO-like approaches to enhancing applications, and that situation will prevail. However, some pioneers are still developing strategies for improved model architectures that are non-data-hungry.
A famous tweet from the author of says, "Data scientists: the ones who need big data will be unemployed in 10 years. Machine learning increasingly shifts from data-intensive to learning-intensive." That turned out to be wrong.
The keyword in the previous paragraph is "quality data." The companies with high cardinality and trustworthy out-of-the-box data are lucky, as they can benefit from that. Many clients have an AI team but need help to get quality data. Some dive into the promising field of synthetic data, which will never replace actual data.
Business starts with quality data, and due to the nature of your business and its applications, you may be lucky to have it out of the box.
Data infrastructure
Data is part of any infrastructure long before the company starts the AI journey. It is usually represented in CSV files, web server access logs, relational or NoSQL database files, etc. We call that raw data. Raw data is stored in the systems used as Data Lakes, hence the name. Azure has a service built on top of Azure Blob Storage called Azure Data Lake Storage Gen 2. It is not an independent service. It is an addition to Azure Blob Storage. You still use the Azure Blob Storage when you use the Azure Data Lake Storage; it is just an interface with some added business features. This feature allows Azure Data Lake Storage to integrate with Azure Synapse Analytics or Azure Databricks for data exploration.
The first step is always to define the problem to solve. That is achieved with the cooperation of the infrastructure engineer, data engineer, and data scientist [1]. The infrastructure engineer supports the data scientist with access to the Data Lake and the tools for convenient data exploration. However, many data scientists are used to consuming Python libs in Jupyther Notebooks for that purpose; a more robust approach is using specialized tools that Azure Synapse Analytics, Azure Databricks, and other 3rd party vendors bring to your Azure environments. The data exploration results in the definition of the problems that can be solved with machine learning using the existing data for training the models. Typically, at that stage, the cooperation between data engineers and data scientists becomes super efficient as the data scientist knows what needs to be extracted from the data lake and how it should be transformed and loaded into the data warehouse. Precise task definitions for the data pipelines are decomposed at that stage. Data engineers start defining those pipelines using Azure Data Factory or similar tools available in Azure. Feature engineering starts.
During the last 3–5 years, specialized Data Warehouse extensions have appeared. Feature stores are not precisely Data Warehouses; they are specialized extensions for Data warehouses and sometimes for Data Lakes. The primary purpose of Feature Stores is to separate the data source specifics from the AI model training code. Data loader portions of the training code consume only the SDK Feature Stores provide. SQL queries are removed from the code. AI engineers deal with features instead of various data source types. Data Warehouses are used for reporting and other analytical applications in the company; AI engineers are not the only consumers of Data Warehouse solutions like Azure Synapse Analytics or Azure SQL Database. On the other hand, only AI engineers consume Feature Stores, and data Marts are extensions of data warehouses.
Feature stores are the last stop for the data before the start of the training process.
Training infrastructure
Data transformations are memory and compute-heavy operations that can be highly parallelized; therefore, sometimes, GPUs are also used for data transformations. However, the main application of GPUs is training the AI model, which is the most expansive stage in the AI development lifecycle.
However, the training stage does not start immediately; before the training begins with the total amount of the training dataset, the hyperparameter tuning and experimentation phases are executed. Typically, AI engineers use 10% to 15% of the data to experiment with hyperparameters. That involves multiple executions of the training code on actual training infrastructure to discover the optimal set of parameters. If the team is relatively big and numerous people are working on the same problems, experiment tracking and sharing becomes a crucial organizational problem that is solved with tools like Azure ML Studio. It has built-in compute auto-deployment features. Azure Kubernetes Service is one of the options the Azure ML Studio uses to deploy a compute cluster for experiments and actual training. Azure ML Studio is a great tool. However, we prefer DVC Studio for experiment tracking, and the DVC itself is the best tool for practicing GitOps in your AI workflows. Some of our clients in finance and healthcare use that as it ensures encapsulation of the data used for training, the model instance file, and the values of the hyperparameters into a single Git commit, providing the model traceability. Azure ML Studio has just a subset of those features.
Typically, when a company succeeds in AI adoption, the costs of GPU-enabled instances skyrocket, and the need to move to OnPrem evolves. Infraheads has an excellent automation framework for bare metal Kubernetes Cluster provisioning to support the needs of our clients. In such cases, Azure provides a wide range of networking services that help build hybrid infrastructures.
Testing and inference infrastructure
After the testing, the model instance is registered in the Azure ML Model Registry. The ML engineers are finalizing the inference service. That is, in general, an API server that reads HTTP Post Requests with the features in the payload and returns the result the model produces. That process is called inference. Azure ML Studio has excellent features to facilitate that process. For highly loaded systems, Azure ML Studio provides horizontal autoscaling features. However, before the production traffic is forwarded to the inference endpoints, the model is tested with the testing data set, another 10% to 15% subset of the actual data. The testing data set is used for inference, and the results are compared with the exact value to assess the model's qualitative performance. The AI engineers can add custom metrics based on the business requirements, the application, and the widely used metrics for that assessment.
Monitoring and redeployment
Once deployed, the model's performance starts decreasing. Therefore, it has to be monitored constantly, and automated retraining and continual deployment workflow should be implemented and tested before going live. Azure Monitor and Application Insights provide an incomplete set of monitoring features.
Data Mesh and microservices
From the overall architecture standpoint, model services should be treated as microservices. SemVer 2.0.0 is to be used for service source code versioning. (Versioning of models is a separate topic that is intentionally skipped).
In analogy with the microservices architecture, a new architecture paradigm is evolving for data. Data Mesh suggests moving away from centralized data lakes and warehouses, creating the data infrastructure for each microservice separately, and delegating the responsibility of the so-called data products to the team that owns the service. We closely monitor the evolution of this concept. However, we still need to implement this kind of architecture.
Comments