a SoftwareMill Group Company
The Proof of Concept (PoC) is an early project stage when you test your idea and assumptions. Its purpose is to evaluate the feasibility and potential of a solution or product idea, it’s short-term and cost-efficient and is a small-scale project, not a production-ready product. If you’re looking for more information about developing a machine learning Proof of Concept, read:
Do you need a machine learning proof of concept?
In this article, we’re having a look at the stages of machine learning projects, the path towards deployment, and the challenges that you might face along the way.
Machine learning projects are not the same as other software development projects - they are based on data and rely heavily not just on engineering, but also on science. Operationalizing an ML project is therefore a path different from that you know from your product development process. Before diving into going production-ready, let’s have a look at the phases of machine learning projects:
Let’s now look at what happens right there in the middle, in the development phase - or in our case, since our work is R&D oriented, it’s a Science as a Service process.
Machine learning does not start with modeling, and the path to actually having a model to be deployed is quite lengthy - though that doesn’t have to translate into a long development process as how long a project will take is very use case specific. Before building any solution, you first start with the business perspective. Understanding the needs, requirements, and objectives is the first step of ML development, whether you start with a Proof of Concept or not. You need to have the theoretical foundation set, which is achieved through an analysis of your use case and a deep dive into scientific literature around the solutions to it. Then, you can move on to an essential part of any ML project: data.
In a perfect world, when starting an ML project, you would already have all the necessary data, preferably in the right format and labeled, prepared for model training. However, perfect isn’t real, so before any data-related work can begin, you often find yourself in need to collect data first.
Please mind that it’s not just any data, though I’m pretty sure you’re aware of that. The data you collect has to be relevant: needed for the use case you’re working on, reliable: small probes or historic data from the far past may not do the job, and of good quality: you know, garbage in = garbage out, that’s just the way it is, so you can take steps to ensure your data isn’t missing values, has few repetitions, has good representation across many classes.
When working on the PoC, you will learn about the data you need to solve the issue you’re working on, how to obtain it in an appropriate way, and store it. Your ML development partner, whether in-house or external, should give you guidelines and instructions on how to work with your data.
That’s just an umbrella term that covers so many tasks related to working with the available data: data preprocessing, data annotation, data visualization. Data can be incomplete, inaccurate, contain outliers, noise, and errors. So after receiving raw data, data engineers perform data preprocessing, which includes data cleaning, data transformation, data reduction. Then there’s data annotation that might be required: it’s the process of labeling data of different formats such as text, images, or video to allow machines to ‘understand’ them.
That, of course, is not all - but for the sake of this article, we don’t have to go in-depth into the details of what happens with data from when it’s received to when it’s ready to be used with an ML model. The most important thing here is to understand that just having data is still not enough as any data requires quite some work before it’s transformed into valuable insights.
Obviously, before you build a model, you have to choose it, but since I’m focusing on the path from a PoC to production, this means that some experimentation and model selection have already happened during that phase. During experimentation, ML engineers explore various architectures to find the best fit for the given needs.
Knowing what model will be the most suitable, ML engineers go to the next step which is modeling, or model development. When the model is assembled, it’s trained on the previously prepared data. The available data is divided into training data and testing data - and the names suggest what these sets do.
In model training, data is passed through the machine learning model to find patterns and make predictions, trying to solve the task it was given. With time, over the course of training, the model gets better at making predictions.
Feature engineering is about formulating features from existing data. In simple words:
Feature engineering is a machine learning technique that leverages data to create new variables that aren’t in the training set. It can produce new features for both supervised and unsupervised learning, with the goal of simplifying and speeding up data transformations while also enhancing model accuracy.
as written by Harshil Patel on the Towards Data Science blog
After the model’s been trained, you need to check its performance. The model is tested with the testing data: previously unseen to the model, which helps verify that it will make good predictions while analyzing new data.
Now, testing in machine learning projects looks a bit different than in software development projects. In conventional testing, you would use unit tests, regression tests, and integration tests, to cover both the smallest testable parts of your software (units) and units plus other components of an application together. Machine learning adds more to the mix.
As stated in Google’s overview of debugging ML models:
Unlike typical programs, poor quality in an ML model does not imply the presence of a bug. Instead, to debug poor performance in a model, you investigate a broader range of causes than you would in traditional programming.
For example, here are a few causes for poor model performance:
- Features lack predictive power.
- Hyperparameters are set to nonoptimal values.
- Data contains errors and anomalies.
- Feature engineering code contains bugs.
Debugging ML models is complicated by the time it takes to run your experiments. Given the longer iteration cycles, and the larger error space, debugging ML models is uniquely challenging.
So every machine learning model needs to go through an evaluation. There are various methods for performing that assessment.
This is well illustrated on the example of a healthcare use case: medical examination and a new disease. We get an overwhelming portion of negative results: for every 10000 negatives, there are 3 positives. If the model correctly detects the negatives but is wrong about 2 out of 3 of the positives, its accuracy is still ~99%, but this metric does not reflect its performance because the model is bad at detecting positives.
We also need precision and recall. In medical use cases, we're going to pay attention to recall and aim to make it higher, sometimes even at the cost of losing some precision. That's because we want to identify all infections to introduce the appropriate treatment methods. Even if a portion of the identified disease cases turn out to be incorrect, which is an issue that can be fine-tuned in the course of development, we're risking much more when we allow the model to NOT detect the disease and let the patient stay oblivious to the fact that they're infected.
So what are some of the important metrics to keep an eye on?
In short, accuracy shows the proportion of correct predictions provided by the model. If it’s high, that’s a reason to be happy.
It’s not enough to know how accurate the model is, it can still be ill-performing. Precision indicates how many of the positive results were correct.
Recall is calculated as the ratio between the number of positives correctly classified as such to the total number of positives.
The F1-score is a combination of precision and recall and it’s the harmonic mean between the two. It looks as follows:
The results of F1-score fall between 0 and 1. If the F1-score is high, this means that both precision and recall are also high and balanced. Same goes for a low score: this means both these metrics are low too. When the results are medium, this means that one of the metrics had a lower score than the other.
For more on model evaluation criteria on the example of instance segmentation, check out the article by Kamil Rzechowski, our Senior Machine Learning Engineer:
Instance segmentation evaluation criteria
Voilà! You’ve evaluated your model and it’s fantastic, so it’s ready to be deployed. Production, here we come!
Though you might think the work’s over, it really is not - a productionized model has to be monitored and maintained. The results need to be observed, and the model may need to be optimized or updated. Since it’s not a one-off thing, but a continuous process of learning and producing results, model degradation can happen over time.
A lot of things can happen from the point of a finished proof of concept to where it’s fully productionized. When you start with workshops that conclude into a proof of concept, you’ve already done a fair share of research and learning, you’ve looked at the possible solutions, tested idea viability, built a project roadmap - so you should be able to avoid many common issues. However, there are certain challenges that can happen to anyone and if you know these issues can arise, you’ll be better prepared to handle them. Let’s now have a look at the post-PoC challenges organizations can face.
With data being the focal point of ML-based systems, it’s only natural that whenever there’s a problem with it, it can be even a major obstacle for your project.
First of all, there’s data quantity: when building a PoC, you may not need vast amounts of data to validate the concept. For actual ML development, however, you do need a whole lot of data. While in some cases, you can obtain relevant data from third parties, in many cases, e.g. medical projects or in use cases that have not yet been researched, it’s not an option.
If you don't have enough data, first, this should be communicated early on in the process. If your ML team knows what amount of data they can get, they can help you figure out ways to obtain more of it when necessary. Then, it’s also possible to adjust the model type to a small amount of training data to roll it out in spite of that - but that’s also not possible in every use case.
But then… There can be too much data, too. If there are huge amounts of data, model training can be very time-consuming and power-intensive. In the case of e.g. billions of records captured every day, you can’t train the model on the entire dataset, and you need to perform sampling - but carefully, as you want your samples to still be representative of the whole dataset. In such a case, it’s best to automate data sampling and integrate such a module into your data processing pipeline.
Last but not least, comes data quality, which is quite a complex matter. Only valuable data will bring you good results. Data is high quality when it suits its purpose (the problem being solved with ML) well. Data quality has dimensions that refer to the aspects that are key in assessing data: accuracy, completeness, consistency, validity, uniqueness, and timeliness. To make sure your data is of good quality, you can e.g. perform data profiling to evaluate the state of your existing datasets and data standardization, which brings all data to a common format.
There’s one essential thing for you to remember here, though: assessing the quality of data (as well as all the other data science and ML work) should be done by trained professionals who can do it efficiently and effectively.
One of the obstacles to productionizing ML models is integrating it into the clients’ existing systems. When this aspect is overlooked at an early development stage, it can give you a headache later on. You need to take into consideration the limitations of your systems, resources needed for integration, as well as the costs.
Sometimes the biggest issues are not technical but related to the human factor. To make machine learning work for your organization, you don’t just need technical know-how, but also a data-driven culture, your team’s willingness to learn new ways of getting things done, stakeholders’ understanding of the value ML brings into the company, and appropriate budget for innovations. All these should be resolved before the PoC happens, but it’s not a given - sometimes it turns out later that the company is in fact ill-prepared for machine learning adoption.
Let’s sum up what you need to remember when you want to scale your ML project from PoC to production. There’s quite a lot of strategic and organizational work first before any modeling comes into play. You don’t have to be perfectly prepared, though. The team that’s working on your proof of concept or machine learning development should guide you through the entire process, ask relevant questions, and bring your attention to common challenges.
I hope this article brought the process of ML development closer and explained what has to happen between the PoC stage and a model on production. If you have any questions about ML project lifecycle, the PoC, or productionizing your project - drop us a line!