Natural Language Processing is sub field of Artificial Intelligence used for Natural Language understanding between human and computers. In Machine Learning, we can process text data to get different insights from data for different tasks like text classification, text clustering, named entity recognition, text translation, text generation and image captioning etc. When we work with Deep Learning models for text processing, we need a lot of data depending on the problem we want to solve using DL.

Sometimes, we dont have enough data for training of Deep Learning models for that specific task. To solve this problem, there are different methods available for increasing data by Data Augmentation. We have different triditional and Machine Learning based methods for increasing our training data by changing data shape, adding noise or creating nearly similar data using current dataset. So, when working specifically with Natural Language Processing, we can also increase dataset by performing different steps on data discussed below.

There are different libraries available for text data augmentation. For this, we are usign NLPAug, an open source python package for data augmentation using different methods and pretrained Deep Learning models. There is very good documentation available on NLPAug Github repository, we will be using some of methods for creating new examples from data.

https://github.com/makcedward/nlpaug

Installation

NLPAug is available on PiP and can be easily installed using pip in cmd/terminal. It also requires some packages to work, so we need to install those as well.

pip install numpy requests nlpaug

If you want to install from source, you can also use it using beta features or you can also use conda to install.

# From Github
pip install numpy git+https://github.com/makcedward/nlpaug.git

# from conda
conda install -c makcedward nlpaug

You may need to install some more packages based on augmentation methods you use using NLPAug. For some data augmentation functions, it uses BERT models so we need to install packages if we want to use these augmentation features.

pip install torch>=1.6.0 transformers>=4.0.0 sentencepiece

# For Antonym and Synonym augmentations
pip install nltk

For more details on installation visit https://github.com/makcedward/nlpaug#installation

Download Models

We will be using different pretrained models and text phrases for data augmentation, so to download, we can use NLPAug package for downloading these models.

from nlpaug.util.file.download import DownloadUtil

# download word2vec
DownloadUtil.download_word2vec(dest_dir='.')
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') GloVe model

# download fasttext model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') 

We will also be using PPDB (The Pharase Database) for Synonym Augmentation of data. Data for PPDB can be downloaded from its website.

http://paraphrase.org/#/download

Augment Data

Once installation is complete and have downloaded required models, we can use different augmentation methods to generate text data. First we have some examples of text that we will use to input to model and generate a slightly different sentence with same meaning. So, lets first create a list of sentences, you can use your dataset as input.

sentences = [
    "This query has taken my invaluable time in the morning",
    "The quick brown fox jumps over the lazy dog",
    "He taught us how to catch errors and how not to write",
    "He walked in crumbling tennis shoes and matched awkwardly like people used to in the seventies",
    "The colors used in the comforter are loud and bright"
]

 We import required libraries that we will use for rest of the tutorial.

import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action
from tqdm import tqdm

Now we apply each model one by one and check output of model.

Contextual Word Embeddings Augmenter

Contextual words embeddings assigns each words a representation based on its context. We will be using insert and substitute actions for this.

Insert

aug = naw.ContextualWordEmbsAug(
        model_path='bert-base-uncased', action="insert")
for i, text in enumerate(sentences):
    augmented_text = aug.augment(text)
    print(f"{i + 1}:", augmented_text)
1: so this query has certainly taken my invaluable time in till the morning
2: the quick acting brown fox jumps slightly over the lazy dog
3: he taught us right how to catch errors and how not afraid to erroneously write
4: he walked in on crumbling tennis shoes easily and matched awkwardly patterns like people used to have in the seventies
5: note the body colors often used in the comforter are loud and bright

As we can see in output that above method added some new words to each sentence based on its context.

Substitute

aug = naw.ContextualWordEmbsAug(
        model_path='bert-base-uncased', action="substitute")
for i, text in enumerate(sentences):
    augmented_text = aug.augment(text)
    print(f"{i + 1}:", augmented_text)
1: this query was taken your longest time in the morning
2: the crazy brown fox jumps over the other dog
3: he taught us how about catch errors to how what to write
4: he walked in the tennis shoes and smiled awkwardly like people used to in movie week
5: the colors depicted in this cover are loud and bright

In substitute, lenght of sentence is same but some words are replaced.

Synonym Augmenter

We can also apply synonym augmentor using phrase database or word net. You will need to download PPDB (The Phrase Database) first.

Substitute word by PPDB's synonym

PPDB can be downloaded from this url. http://paraphrase.org/#/download

Download and extract database to use it and change path to your extracted file path.

aug = naw.SynonymAug(aug_src='ppdb', model_path="ppdb-2.0-tldr/ppdb-2.0-tldr") # Change Path to your directory
for i, text in enumerate(sentences):
    augmented_text = aug.augment(text)
    print(f"{i + 1}:", augmented_text)
1: This query defends taken my useful time requests the morning
2: The timely brown fox jumps over the lazy lapdog
3: He helped us how to fish errors and how n't to write
4: He fucked in crumbling tennis purchases and matched awkwardly similar people government to in the seventies
5: The colors used in the comforter threaten noisy and radiant

Substitute word by WordNet's synonym

This requires NLTK (Natural Language ToolKit) installation and will download some of its language models if not downloaded already.

aug = naw.SynonymAug(aug_src='wordnet')
for i, text in enumerate(sentences):
    augmented_text = aug.augment(text)
    print(f"{i + 1}:", augmented_text)
1: This query has taken my invaluable clock time in the first light
2: The quick brown fox jumps over the indolent hot dog
3: He instruct us how to get errors and how not to write
4: Helium walked in crumple tennis shoes and matched awkwardly like multitude used to in the seventies
5: The colors used in the quilt cost flashy and bright

Word Embeddings Augmenter

We can also use words embedding augmneter usign Google News Vector model. Required models can be downloaded from steps described in download section.

Insert

aug = naw.WordEmbsAug(
    model_type='word2vec', model_path = 'GoogleNews-vectors-negative300.bin',
    action="insert")

for i, text in enumerate(sentences):
    augmented_text = aug.augment(text)
    print(f"{i + 1}:", augmented_text)
1: This query has taken my Italy invaluable rally time in caustic the morning
2: The quick brown fox & jumps over the lazy ist dog
3: Judge He taught Islay us how to SCORES catch errors and how not to write
4: He de walked in crumbling President tennis shoes and de matched awkwardly like people used LEGACY to in the seventies
5: The colors Mary used in spokesman the comforter are loud emptiest and bright

Substitute

aug = naw.WordEmbsAug(
    model_type='word2vec', model_path = 'GoogleNews-vectors-negative300.bin',
    action="substitute")
for i, text in enumerate(sentences):
    augmented_text = aug.augment(text)
    print(f"{i + 1}:", augmented_text)
1: This query decade transfered Aunt_Ebb invaluable time in the morning
2: The quick brown crow jumps for the lazy dog
3: He taught us how to hoop_netters errors and how sir_Mendell to penned
4: He walked iin buckling tennis shoes and matched awkwardly kind've people used to in the sixties
5: The colors commonly_referred in time comforter are defeaning and bright

We can view output for both insert and substitute from google news vectors. We can check for some sentences, we get very good output and it is usable for input to deep learning models.

Conclusion

We have used diffferent models provided by NLPAug and there are a lot of different other models that we can use for text data augmentation which you can use if they fit to your requirements. For more details visit NLPAug github or view example notebooks on NLPAug github page.