Data Augmentation

Introduction
Getting Started

2.1. Install Dependency

2.2. Install Intel Extension for Transformers
Data Augmentation

3.1. Script

3.2. Parameters of Data Augmentation

3.3. Supported Augmenter

3.4. Text Generation Augmenter

3.5. Augmenter Arguments

Introduction

Data Augmentation is a tool to help with augmenting NLP datasets for machine learning projects. This tool integrates nlpaug and other methods from Intel Lab.

Getting Started

Install Dependency

pip install nlpaug
pip install transformers

Install Intel Extension for Transformers

git clone https://github.com/intel/intel-extension-for-transformers.git itrex
cd itrex
pip install -r requirements.txt
pip install -v .

Data Augmentation

Script

Please refer to example.

from intel_extension_for_transformers.utils.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
aug.input_dataset = "dev.csv"
aug.output_path = os.path.join(self.result_path, "test1.cvs")
aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
aug.data_augment()
raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
self.assertTrue(len(raw_datasets) == 10)

Parameters of Data Augmentation

Parameter	Type	Description	Default value
augmenter_type	String	Augmentation type	NA
input_dataset	String	Dataset name or a csv or a json file	None
output_path	String	Saved path and name of augmented data file	"save_path/augmented_dataset.csv"
data_config_or_task_name	String	Task name of glue dataset or data configure name	None
augmenter_arguments	Dict	Parameters for augmenters. Different augmenter has different parameters	None
column_names	String	The column needed to conduct augmentation, which is used for python package datasets	"sentence"
split	String	Dataset needed to conduct augmentation, like:'validation', 'training'	"validation"
num_samples	Integer	The number of the generated augmentation samples	1
device	String	Deployment devices, "cuda" or "cpu"	1

Supported Augmenter

augmenter_type	augmenter_arguments	default value
"TextGenerationAug"	Refer to "Text Generation Augmenter" field in this document	NA
"KeyboardAug"	Refer to "KeyboardAug"	NA
"OcrAug"	Refer to "OcrAug"	NA
"SpellingAug"	Refer to "SpellingAug"	NA
"ContextualWordEmbsForSentenceAug"	Refer to "ContextualWordEmbsForSentenceAug"

Text Generation Augmenter

The text generation augment contains recipe to run data augmentation algorithm based on the conditional text generation using auto-regressive transformer models (like GPT, GPT-2, Transformer-XL, XLNet, CTRL) in order to automatically generate labeled data. Our approach follows algorithms described by Not Enough Data? Deep Learning to the Rescue! and Natural Language Generation for Effective Knowledge Distillation.

First, we fine-tune an auto-regressive model on the training set. Each sample contains both a label and a sentence.

Prepare datasets:

from datasets import load_dataset
from intel_extension_for_transformers.utils.utils import EOS
for split in {'train', 'validation'}:
    dataset = load_dataset('glue', 'sst2', split=split)
    with open('SST-2/' + split + '.txt', 'w') as fw:
        for d in dataset:
            fw.write(str(d['label']) + '\t' + d['sentence'] + EOS + '\n')

Fine-tune Causal Language Model

You can use the script run_clm.py from transformers examples for fine-tuning GPT2 (gpt2-medium) on SST-2 task. The loss is that of causal language modeling.

DATASET=SST-2
TRAIN_FILE=$DATASET/train.txt
VALIDATION_FILE=$DATASET/validation.txt
MODEL=gpt2-medium
MODEL_DIR=model/$MODEL-$DATASET

python3 transformers/examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path $MODEL \
    --train_file $TRAIN_FILE \
    --validation_file $VALIDATION_FILE \
    --do_train \
    --do_eval \
    --output_dir $MODEL_DIR \
    --overwrite_output_dir

Secondly, we generate labeled data. Given class labels sampled from the training set, we use the fine-tuned language model to predict sentences with below script:

from intel_extension_for_transformers.utils.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
aug.input_dataset = "/your/original/training_set.csv"
aug.output_path = os.path.join(self.result_path, "/your/augmented/dataset.cvs")
aug.augmenter_arguments = {'model_name_or_path': '/your/fine-tuned/model'}
aug.data_augment()

This data augmentation algorithm can be used in several scenarios, like model distillation.

Augmenter Arguments:

Parameter	Type	Description	Default value
"model_name_or_path"	String	Language modeling model to generate data, refer to line	NA
"stop_token"	String	Stop token used in input data file	EOS
"num_return_sentences"	Integer	Total samples to generate, -1 means the number of the input samples	-1
"temperature"	float	parameter for CLM model	1.0
"k"	float	top K	0.0
"p"	float	top p	0.9
"repetition_penalty"	float	repetition_penalty	1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_augmentation.md

data_augmentation.md

Data Augmentation

Introduction

Getting Started

Install Dependency

Install Intel Extension for Transformers

Data Augmentation

Script

Parameters of Data Augmentation

Supported Augmenter

Text Generation Augmenter

Augmenter Arguments:

Files

data_augmentation.md

Latest commit

History

data_augmentation.md

File metadata and controls

Data Augmentation

Introduction

Getting Started

Install Dependency

Install Intel Extension for Transformers

Data Augmentation

Script

Parameters of Data Augmentation

Supported Augmenter

Text Generation Augmenter

Augmenter Arguments: