However, as the volume of AI projects expanded, many ML practitioners noticed that an over-focus on models and under-focus on data quality was having an impact on algorithm performance. This sparked a shift towards data-centric AI, which puts data quality and proper management at the heart of AI projects – leading to more accurate real-world results.
A neural network is a series of algorithms that seeks to identify underlying relationships in data sets and does so by mimicking how the human brain operates. In this article, we’ll explore how these networks have developed over time and how moving from a model-centric to a data-centric approach can lead to AI project success.
The State of Neural Networks Today
Today, companies have access to countless open-source deep neural networks and datasets, as well as pre-trained neural networks which have been trained on masses of data by researchers or big tech companies.
These pre-trained algorithms allow tech teams to efficiently leverage AI to solve real-world problems without having to spend the time and resources on building and training the algorithms themselves. For example, they can use AI models which have been trained on industry-specific data that are relevant to their sector.
Tech teams can also use a technique called transfer learning, which entails using neural networks that were originally trained on certain data sets (e. g. images of dogs) to solve a specific problem (e. g. recognize dogs), to solve a similar problem (e. g. recognize cats) by training only a small part of the network with a small amount of high-quality data.
As a result, businesses can deploy ready-to-use algorithms that can be applied to a lot of real-world use cases including object classification, defect detection, language understanding, churn prediction, sentiment analysis, and many more.
As the use cases for these algorithms increased, however, it became more difficult to enhance the performance of the neural network by focusing on the models alone.
What Model-centric AI Get’s Wrong
One mistake made by the AI community was simply “throwing” a lot of data at the neural networks without really understanding or analyzing it. We knew that the neural networks would be able to automatically find the necessary features in the data to achieve the defined goal, so we didn’t deem it necessary to extract those features ourselves.
However, this meant that we were sacrificing data quality. Data can be biased, imbalanced, incomplete, inconsistent, invalid, irrelevant, duplicate, or just plain wrong. Data can also be misleading. For example, a neural network was trained to recognize if an image shows a huskie or a wolf. It worked well until applied to the real world. The network failed often, recognizing huskies as wolves although the differences were clearly recognizable. It turned out that all images of wolves used for training had snow in the background, while the images of huskies did not. The neural network did not really learn to distinguish between wolves and huskies, but between images with and without snow.
Clearly, there was a need to improve the quality of the data being fed to the algorithms to improve their performance. In fact, according to McKinsey’s State of AI 2021 report, organizations seeing the highest financial returns from AI consistently train and test their data more than others.
And so, the movement around data-centricity emerged. At intive, we saw this approach being adopted early on in AI projects across industries. However, following a data-centric path doesn’t mean that ML practitioners no longer need the algorithmic experience. In fact, it’s important to have both. Only if you have a deep, holistic understanding of both parts – the data, and the ML algorithms – can you solve hard real-world tasks successfully. You need to be able to evaluate existing state-of-the-art solutions, adjust them, tweak them, and understand why they might work or not for your specific problem.
How Does Data-centricity Work?
At intive, we care about our clients’ data and strive to add real value to our clients’ products. Here’s how the process looks in action:
At the beginning of most projects, we take a deep look at the data in Data Discovery Sprints. During these one or two sprints, we understand and find the value in the data, as well as find the flaws. Studies have shown that poor data quality is one of the primary reasons why AI projects fail, so this step is paramount.
We then estimate how much work has to be done to improve the data quality. We assess whether or not we have to transform the data, clean it, collect more data, correct the labels, handle outliers and missing values, or create additional data with data augmentation techniques.
It’s crucial to set up the data and ML pipelines by applying MLOps early in the project. This guarantees a smooth workflow, traceability of data and ML models, clear and clean processes, and deployment readiness.
After this phase, we build a proof of concept and work on the data while developing the PoC. When our solutions are deployed in the real world, it is extremely important to be able to monitor the real-world data, process it, and train the algorithms with the incoming data, as real-world data can significantly differ from training data.
Parts of this process are by no means rigid and can look different for each project. Luckily, our clients benefit from our teams’ years of experience in building and “productionizing” AI-powered products for a range of industries, including automotive, healthcare, media, eCommerce, Retail, fintech, and technology.
Applying a data-centric approach to our clients’ AI projects has allowed us to help them produce reliable real-world results and transform their businesses. Interested in seeing how we might be able to support your AI needs? Get in touch today.