Training Data Amount

Data and Machine Learning: The Importance of High-Quality Training Data

We have all heard the old adage that practice makes perfect, but the reality is that perfect practice makes perfect. It is this refined take on the importance of quality practice that is of the utmost importance in machine learning. Even the most robust machine learning models can be rendered dysfunctional if built upon the foundation of inadequately or insufficiently trained data.

Data is Messy


The format for training data varies widely and demonstrates the broad spectrum of potential uses for machine learning. This data can include images, text, clicks, audio, video, etc. A model for a self-driving car might utilize images and videos that are labeled to distinguish between the road, cars, and people. Alternatively, a customer service chatbot might parse through text representing the many ways that someone could ask for a refund or shipping details.

That data is then labeled (tagged, annotated, etc.– pick your word) to pinpoint specific elements of the data that form a pattern for the algorithm. This involves identifying the “intent” of the question or statement and the entities within the phrase (we’ll get to that shortly).

data tagging exampleBy identifying similar patterns the algorithm can then self-identify unlabeled data in the future. It is here where the machine is learning and it is precisely why the data that forms  these patterns must be high quality or “clean.” Feeding the machine with poor or “dirty” data creates problems that will compound down the line as the algorithm is put into use and interprets the wrong patterns in the future.

Imagine you are creating a new music streaming platform that is focused on providing people with recommendations for new music and new playlists. The goal is to provide people with the right music at the right time. The data interpreted here could include not only genres and artists but could also include the time of day, day of the week, and even the location of listening.

Take a sample user who listens to a variety of different genres. Without properly training the data and keeping the data clean, the algorithm erroneously recommends heavy metal before bed and classical piano during workouts because it has honed in on the length of songs. While people have a variety of different music tastes, it would be safe to assume that the algorithm is not identifying relevant patterns in this user’s data.

“Effective training, not just training itself, is what makes or breaks a machine learning model.”

Focusing on appropriately training the data, the result could be very different. Monday mornings might serve heavy electronic music after identifying the individual who tries to jump-start their week. Wednesday nights could provide soft jazz as the listener displays patterns that indicate it is their recurring date night. To cap off the workweek, the recommendation of upbeat classic rock on Friday nights at a neighbor’s house is representative of the user handling the aux cord with friends.

The original inputs into the algorithm in this scenario are no different between examples. Instead, it is the manner in which the data is trained which decides the effectiveness of the program. Effective training, not just training itself, is what makes or breaks a machine learning model.

How is quality data labeled?

Tagged data is annotated to convey to the model the outcome that it is designed to predict. Data labeling can be done through different formats, but it always requires a level of human input. The manner in which people are involved varies based on the nature of the machine learning model as well as the type of problem they are trying to solve.

There are two ways in which a machine learning model learns: supervised learning and unsupervised learning. Supervised learning is when people are tasked with deciding the different features of the data and guiding the model to identify patterns. Unsupervised learning entails the algorithm using unlabeled data to detect patterns such as data clusters or associations.

As it gets off its feet, a machine learning algorithm is performing guess and check while it attempts to establish and recognize patterns. With supervised learning, that checking is done by a human to ensure that the machine is moving from scattered guesses to confident and accurate decisions. Over time, through high-quality training data, the machine learning algorithm can begin to operate with more independence, as it is trained to recognize more and more patterns successfully. In short, it is becoming smarter over time.

Intents and Entities

Before labeling any of the data, it is crucial to create a list of the intents (outcomes) that the algorithm is designed to predict. With a self-driving car, those outcomes could range from slowing down for traffic to accelerating at a green light. For an eCommerce site, the desired actions might be offering complimentary items at the point of purchase or sending follow-up emails to abandoned carts. Whatever the machine learning application, it is critical that, before you set sail, you have a clearly mapped out final destination to steer toward.

In addition to labeling the outcomes, it is also important to target specific entities or identifiers which help signal the intent to the algorithm. For an eCommerce site, that could be things such as clothing type or time spent on a page. On the other hand, with a self-driving car, it would be helpful to identify features such as speed or location as they will have a significant impact on the outcome.

What is important, and practically essential with intents and entities, is to start with the former. Figure out where you are going first, and then decide what identifiers will help guide you along the way.

Starting the training process

With the potential outcomes and entities integrated into the system, a human can now begin training the data. It is important that whoever is training this data understands what the algorithm is designed for and has a high level of attention to detail. Especially in the initial dataset, it is crucial that the person only trains data when they have a high degree of confidence in both the identifiers and the desired final outcome. If the human training the data is not confident with how the data should be labeled, the machine cannot in turn be expected to effectively recognize patterns.

While there is not a definitive answer to the volume of training data that is required to get a machine learning model up and running effectively, more data will be better. The quantity of data needed will largely depend on the complexity of the problem the model is designed to solve. The best way to tell if a model has sufficient data is to test it and see how it performs. If it is confidently identifying certain patterns but missing out on others, there may be a need for an increase in specific types of training data before putting the model out in the world.

Training Data Amount
user39663, How to get the data set size required for neural network training?, URL (version: 2016-09-05):

It is critical to remember here that while the quantity of training data is very important, it should never come at the expense of a dip in quality.

Let’s go back to our music platform. Before going in and labeling the data, we need to decide what the outcomes are. With this specific model, there might be a few different intents: recommend new yet similar songs to current music, provide music that is outside the listener’s current scope but still related, or compile songs that the user already listens to based around a certain mood.

To get to that outcome, there are various types of identifiers that can help guide the machine. While it is apparent that the genre and artist should be labeled, many other entities could be helpful for the system: the popularity of artists, time of day, location, tempo, release date, etc..

In order for a machine learning model to prove successful, it is important that there are specifically identified outcomes that the model is supposed to predict. To make things easier for the machine, it is vital to also provide it with key identifiers that help guide its identification of patterns. And, as it goes with any machine, it is important to continue working on it to ensure that it is able to maintain peak performance over an extended period of time.

The Importance of Continuous Training

When introducing a new team member, it is essential that they go through sufficient training in order to properly understand the job and get integrated into the workflow. But, with the best of employees, the learning doesn’t stop there. Every year, every month, every week, and every day presents a new opportunity to refine their skills and perform better at their job.

Machine learning models are not so different from us humans in this respect. That is why high-quality training data is important to get the machine off the ground smoothly from the start. It is also why it is crucial to continuously train the model to help it get more and more effective at the job it was designed to handle. A well-built feedback loop in order to take care of edge-cases resulting from new inputs or the ever-changing environment it was deployed in.

Supervised Feedback Loop
“Designing Effective Supervised Machine Learning Systems”, Joe Morrison

One of the advantages of continuously training data is that it is a more efficient way of gathering and training data. With the initial data set, it must all be fed into the system from some previous location. Depending on how the system is set up, this can be very time-consuming. When data is going directly into the machine learning model, as it is when the model is running, it is immediately where it needs to be and ready to be trained.

Another benefit of continuously training a machine learning model is the relevance of the data that it is taking in. Depending on how a specific model is being used, it might be taking in data from consumers. That data is naturally prone to changes and reacting to these changes in an effective manner is vital. If a model has only been trained off of old data with old preferences, new consumer data may be interpreted incorrectly and thus the wrong outcomes chosen by the machine. 

Finally, just as updates are made to phones and computers, so too will updates be applied to machine learning models. When these updates are implemented, it is possible that different outcomes are desired or different types of data will be coming in. When this occurs, the model must be trained again to allow for a seamless transition.

The Future is Human + Machine

One last time, we will return to the music platform. It has been running seamlessly for some time dishing out the right music at the right time. But suddenly, the algorithm starts tripping up, and, without the help of a human, it cannot self-correct. As it turns out, the sample user has just had a child and altered their listening habits. 

Gone are the date nights with soft jazz and Friday nights listening to classic rock with the neighbors. The user’s life now revolves around their tiny human. Their listener profile now demonstrates a radical shift to classical piano as the listener attempts to soothe and boost the IQ of their child. Whether the Mozart effect will work is out of the platform’s hands, but the accurate song selection is not.

While it is possible that the algorithm might be able to effectively respond to this change on its own, it also may not. It might require new training by the humans in charge to allow the machine to adapt to this previously unencountered pattern shift. As is the beauty of machine learning models, after properly training this new set of data, it will be able to effectively respond to similar changes by aspirational parents in the future.

Quality training data is at the core of effective machine learning models. Without it, these incredible systems will be crippled. Only through the development of a symbiotic partnership between humans and machines can we effectively step into the future of work.

Have you talked to Hoover yet? If not, click on the floating orange owl and start chatting! There you can learn more about how our natural language processing tools can streamline your business!

Join our mailing list today to receive a newsletter covering the latest trends in machine learning applications, data science, and much more!