ReasonField Lab
VirtusLab Group Company
hello@reasonfieldlab.com
ReasonField Lab 2024, All Rights Reserved
This blog post will cover technical information and tools for performing Machine Learning Operations (MLOps) in practice. Firstly, I will outline MLOps principles and how to apply them, then I will go over levels of MLOps in projects and finish with an example of how Revolut is doing it.
Now, I am going to explain the pillars of MLOps, which guide you to a robust and mature ML system.
In the early stages of machine learning projects, versioning and reproducibility are not of great focus. However, once the project reaches a more advanced stage, it is essential to get reproducible and deterministic results using the same setup and hyperparameters.
Code for experiments might change dynamically, so it is vital to keep track of these changes via version control tools like Git.
Apart from the software code base, we can version:
Monitoring is often considered the last step in machine learning projects. However, It should be taken into account from the start, even before deploying the model to production. Monitoring should not be envisioned as a deployment-only matter, as you should also track training experiments. During each experiment, the following can be tracked:
Weights & Biases, Neptune or MLFlow, are worth monitoring during training.
When considering inference monitoring, we can divide it into the following groups:
For inference, monitoring tools largely depend on the setup you are using. If models are deployed using AWS Lambda service, AWS Cloudwatch provides extensive logging experience. In GCP, corresponding products are Google Cloud Functions for deployment and Stackdriver for logging and monitoring. On the other hand, if Kubernetes is used as a deployment environment, the following should be considered:
Testing machine learning systems is still not nearly as formalised as it is with software. What can we actually test? Here is the list:
Adding tests will make your system more robust and resilient to unexpected changes on both data and infrastructure ends. With data validation, distribution, and statistics can be compared with regard to training and production data.
Ordinary software testing practices can be applied when performing tests on pipelines and transformations. For the infrastructure, code created using IaaC can be covered with unit tests. Compliance tests are problem-specific and should be implemented with great care.
Each of the previous principles is related to this step. With more mature machine learning systems, iteration and retraining of models are happening more often. At a certain point, this frequent retraining becomes possible only if you have every step automated. MLOps teams aim to have an end-to-end deployment of models without the need for human intervention.
As this field is still developing, there is a lot of freedom in what suits your use case best. The following guidelines should be considered as a starting point for exploration (State of MLOps).
According to Google, there are three ways of implementing MLOps: manual process, semi-automated, and fully orchestrated.
This level is typical for companies that only start their adventure with MLOps, with manual ML workflow and decisions driven by Data Scientists. Models are trained infrequently. Its characteristics look as follows:
ML systems on that level are prone to frequent failures in the dynamic data environment. Using CI /CD and tracking/monitoring algorithms is a good practice to solve this issue. Having an ML pipeline with monitoring and CI/CD will allow for rapid tests, stable builds, and safe introduction of new implementations. This is the last level, where models are trained locally. On all the levels above (1 and 2), models are trained in the cloud as a job, for example, using Airflow or another tool.
On level 1, the focus is on continuous training via automated ML pipelines. This setup is suitable for solutions operating in a constantly changing environment.
Its characteristics look as follows:
On level 1 of MLOps, both deployment and training code are published to the cloud.
Additional components that have to be implemented apart from the ones already on level 0:
Level 1 MLOps is great when you want to automate your training. However, it enables only changing data, while any modifications to the training scheme require redeployment of the whole pipeline.
This level is typical for highly tech-driven companies, which often experiment with different implementations of pipeline components. Additionally, they often (daily or hourly) train new models, update them within a moment and redeploy them to clusters of servers simultaneously. Without an end-to-end MLOps cycle, doing so would not be possible for these companies.
Characteristics of this level look as follows:
Both data and model analysis are manual processes, although tools are provided.
To show you better how technological companies apply MLOps, I will present the example of the UK-based financial technology company Revolut. Their main product is an app offering various banking services, with millions of daily transactions.
They employ a machine learning fraud prevention algorithm called Sherlock to avoid losses due to fraudulent card transactions.
Revolut generally keeps its services on Google Cloud, so all the components are deployed there. For data and feature management, they use Apache Beam via DataFlow. The model development is done using Python and Catboost. Models are deployed as a Flask app on App Engine. For in-memory storage for customer and user profiles, they use Couchbase.
Production orchestration is handled by Google Cloud Composer (running Apache Airflow).
For monitoring, Revolut uses a combination of two tools. The First one is Google Cloud Stackdriver, which is for real-time latency analysis, the number of transactions processed per second, and more. The second one is Kibana, which is used for monitoring merchants, the number of alerts and frauds, and model performance - true positive and false positive rates. Signals from monitoring are forwarded to the fraud detection team via email and SMS.
Model lineage is tracked thanks to Cloud ML Engine. Finally, human feedback is gathered in-app so each user can give feedback in the app on whether the transaction was malicious. As this is a classification task, we can collect the labels directly for the new dataset version.
For interested readers I can recommend the following resources:
In this blog post, we went deeper into MLOps, starting with MLOps principles, followed by a description of levels in MLOps. Ultimately, we explored the MLOps example of the Revolut fraud prevention system.