Data Science

Natural Language: A Guide to Labeling and Training NLP Datasets

August 31, 2022
8 minutes

As technology evolves, the ways people interact with it also changes. Internet searches have become exponentially easier. It wasn’t long ago that what you typed into a search bar had to be very specific and would often yield strange and unrelated results. 2016 wasn’t that long ago and even the Times wrote about advanced “tips” for using the search bar. Today, with more advanced predictive text features, it seems search engines (or anywhere you can type text or leverage natural language) can almost read your mind and know exactly what you are looking for.

Data science and parenting are both human jobs

Language is a vital part of human connection. Although all species have their ways of communicating, humans are the only ones that have mastered cognitive language communication. Language allows us to share our ideas, thoughts, and feelings with others. Its power has built the world around us. It’s no surprise then that one of the most aspirational use cases for Artificial Intelligence is to process, understand, and generate language.

Language is what makes us human. It is how we communicate.

By learning a language, it means you have mastered a complex system of words, structure, and grammar to effectively communicate with others.

To most people, learning languages comes naturally as we develop from a young age but, the reality is languages are layers of complex rules that often break, depending on context. Even if all of the rules were true all of the time, a great deal of brain power is required to comprehend vague articles, illogical clauses, irony, and figures of speech. Humans are remarkably good at interpreting pieces of language that are illogical or don’t follow the rules. People are able to use past experience, context, and facial expressions to fill in the holes and interpret the speaker’s intent.

The complexities of understanding language that come naturally to humans are extremely difficult to teach a machine. How are we supposed to communicate this nest of tangled rules, structures, contexts, and logic to a computer?

Natural Language Processing (NLP) is a form of artificial intelligence that helps technology interpret human language.
Natural Language Voice Assistants

How do we teach machines to learn language? It’s not too dissimilar from how we teach and learn languages ourselves. If you’ve ever had the pleasure of observing toddlers before in your life, welcome to the wonderful world of machine learning aka parenting a machine. Most parents spend many hours watching their baby discover the world and develop behavioral patterns like any data scientist has done before in front of their train/test results.

Teaching a NLP Model
Source: "Letter A Tracing Worksheet"; Coloring Rocks

You won’t find any parent who will tell you that parenting is straightforward and easy (and if someone does, that’s called a lie). You have to constantly question yourself, what you are teaching this toddler, and adjust to their constantly evolving neural network.

Data Scientists to some extent hold the parenting responsibility. If you are recruiting a Data Scientist or want to become one and expect it to be only about programming, it’s like expecting a parent to raise a child like a dog and hope he becomes a stable adult while hearing only “sit” and “rollover” during their whole childhood.

Algorithms learn by being trained on data. Good thing you’ve been storing those customer transcripts for the past decade, right? Well, chances are the data you’ve stored isn’t quite ready to be used to be understood by machine learning algorithms. The data you want to use usually needs to be enriched or labeled.

What Is AI/ML?
How does machine learning really work? Read More

Why is training data important?

Training data is a type of data used for teaching a new application, model, or system to begin recognizing patterns dependant on a project’s requirements. Training data for AI or ML is slightly different, as they are labeled or annotated with certain techniques to make it recognizable to computers.

This training data set helps machine algorithms find relationships, develop understanding, make decisions, and evaluate their confidence when making a prediction. And the better the training data is, the better the model performs.

In fact, the quality and quantity of your training data has more to do with the success of your data project than the magical machine learning algorithms themselves. This statement is exponentially true for language understanding projects.

What is training data and why is it important for machine learning? Read More

How Much Training Data Is Enough?

There’s really no hard-and-fast rule around how much data you need. Different use cases, after all, will require different amounts of data. Ones where you need your model to be incredibly confident (like self-driving cars) will require vast amounts of data, whereas a fairly narrow sentiment model that’s based on text necessitates far fewer data.

How do you label natural language data?

There is no magic wand to wave in order to take your language data sets and transform them into training data sets that machine learning algorithms can use to begin making predictions.

Today, humans are needed in the data annotation/labeling process in order to identify and classify information. Without these labels, a machine learning algorithm will have a difficult time predicting attributes that enable understanding of spoken or written language. When it comes to annotation, machines can’t function without humans-in-the-loop.

The process of labeling any kind of data is complex. It is possible to manage this entire process in excel spreadsheets but this easily becomes overwhelming with all that needs to be in place:

  • Quality assurance for data labeling
  • Process iteration, such as changes in data feature selection, task progression, or QA
  • Management of data labelers
  • Training of new team members
  • Project planning, process operationalization, and measurement of success

Types of annotations in a natural language data set

1. Utterances

Language data sets consist of rows of utterances. Anything that a user says is an utterance. In spoken language analysis, an utterance is the smallest unit of speech. It is a continuous piece of speech beginning and ending with a clear pause.

For example:

“Can I have a pizza?”

“How is the weather in Delhi?”

These sentences are called utterances.

In some language projects, you will need to first complete an utterance parsing exercise. For instance, let’s build a machine learning model for language understanding customer reviews. Most reviews are not a single sentence and cover more than one idea a consumer might be trying to convey. The first step would be to split each review into separate utterances.

2. Intent

These are the actions that users want to execute. Intents, in simple terms, are the machine predictions of a user’s intentions that are drawn from the utterances.

NLP Intent Examples

The above example shows us an intent that directly maps to the business of the bot, i.e., looking for a storage unit. Hence it falls under the category of business intent.

There is another type of intent named casual intents. These are mostly the openers and closers of the conversation like “hi”, “bye” ,“thanks, goodnight”, etc.

Casual intents can also be affirmative or negative. So, a simple “yes” or “no” can also be intents with others like “no, thanks”, “yeah, sure”, etc.

3. Entity

When we read text, we naturally recognize objects like people, values, locations, and so on. For example, in the sentence “Steve Jobs is one of the founders of Apple, a company from the United States” we can identify three types of entities:

  • “Person”: Steve Jobs
  • “Company”: Apple
  • “Location”: United States

Entity annotation is one of the most important processes in the generation of chatbot, voice assistant, and other NLP application’s training data sets. It is the act of locating, extracting, and tagging these distinct and independent objects in text.

Entities in a language model are predefined categories. Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more.

NLP Entity

Who does the labeling?

According to research from Cognilytica, companies spend five times as much on internal data labeling as they do with 3rd parties. Not only is this costly, but it is also time-intensive; taking valuable time away from your team when they could be focusing their talents elsewhere. What’s more, building the necessary annotation tools and data pipelines and workflows often requires more work than some ML projects.

Organizations use a combination of software, processes, and people to clean, structure, or label data. In general, you have four options for your data labeling workforce:

  • Employees – They are on your payroll, either full-time or part-time. Their job description may not include data labeling.
  • Managed teams – You use vetted, trained, and actively managed data labelers (e.g., CloudFactory or swivl).
  • Contractors – They are temporary or freelance workers.
  • Crowdsourcing – You use a third-party platform to access large numbers of workers at once.

How do you measure data quality?

Machine learning is an iterative process. Data labeling evolves as you test and validate your models and learn from their outcomes, so you’ll need to prepare new datasets and enrich existing datasets to improve your algorithm’s results.

Your data labeling team should have the flexibility to incorporate changes that adjust to your end users’ needs, changes in your product, or the addition of new products. A flexible data labeling team can react to changes in the business environment, data volume, task complexity, and task duration. The more adaptive your labeling team is, the more machine learning projects you can work through.

While the terms are often used interchangeably, we’ve learned that accuracy and quality are two different things.

  1. Accuracy in data labeling measures how close the labeling is to ground truth, or how well the labeled features in the data are consistent with real-world conditions. This is true whether you’re building computer vision models (e.g., putting bounding boxes around objects on street scenes) or natural language processing (NLP) models (e.g., classifying text for social sentiment).
  2. Quality in data labeling is about accuracy across the overall dataset. Does the work of all of your labelers look the same? Is labeling consistently accurate across your datasets? This is relevant whether you have 29, 89, or 999 data labelers working at the same time.

Want to learn more about Natural Language Processing and if your team is better suited to build or partner with a platform already in the space?

Natural Language Processing: A Guide To Making The Build Vs. Buy Decision

Check out our Guide to making the build v. buy decision for your next NLP project.

Read More

Similar posts

Get started today

See how we can help automate your business today.
Book a demo!