Skip to content
/ nmt Public

A useful framework to evaluate and compare different Machine Translation engines between each other on variety datasets.

License

Notifications You must be signed in to change notification settings

Optum/nmt

Repository files navigation

Contributors Forks Stargazers Issues Apache License 2.0

Machine Translation Evaluation framework

About The Project

The goal of this repository is to provide useful framework to evaluate and compare different Machine Translation engines between each other on variety datasets.

The goal of evaluating Machine Translation quality have several complexities, like finding suitable test data, agreeing on the metric, attaching different translation engines and so on.

Our own goal was to build a system that allows us to compare several MT engines between each other on variety on datasets and to be able to repeat evaluation with some constant frequency. Surprisingly we cannot find any existing open source solution that will fit our requirements. So we created our own solution and decided to publish it in a case somebody will find it useful.

We're also providing current NMT evaluation results that we obtained during our own evaluation.

Usage

First clone our repository, cd to the root folder and install all the required libraries

pip install -r requirements.txt

Evaluation happens in 3 stages:

  • Download and prepare test dataset
  • Translate all datasets with required MT engines
  • Evaluate translations by required metrics

All evaluations scripts are stored in evaluation folder.

To reproduce our results

# Scripts are called as modules (python -m ...) rather then scripts
# to simplify import for siblings folders
python -m evaluation.01_download_datasets
python -m evaluation.02_translate_datasets
python -m evaluation.03_evaluate_translations
# Final evaluation results will be in benchmarks folder:
# benchmarks/main_evaluation_set.tsv
# benchmarks/main_evaluation_set.tsv

# Careful – it took about a day to run the whole evaluation, even on a good GPU machine
# Careful – if you are providing your own keys for cloud services it will cost you money (30$ for Azure and 50$ for Google for this evaluation).

These three commands will download the same datasets that we use, and evaluate them against the same engines that we evaluated

If you want to run your own evaluation

If your dataset can be downloaded from OPUS

  1. Create your own yaml config in configs/dataset_to_download

    Config file should contain source and target language and list of datasets that we want to download from OPUS

source_language: en
target_language: es

datasets:
    # Name of the dataset should be one from
    # https://opus.nlpl.eu/opusapi/?corpora=True
  - name: TED2013 
    # Type is your own tag to differntiate between datasets
    type: General
    # test, dev, train are fixed subsets
    # You can skip any of them in the config but cannot add your own
    # We guarantee that there are no data duplicates between these three sets
    test:
      # Required number of lines in the dataset
      # Actual size may be smaller due to removal of duplicates and short lines
      size: 10
      # Do we allow duplicates in the dataset
      no_duplicates: true
      # Minimal length in characters of the string to be included in the dataset
      min_str_length: 30
    dev:
      size: 100
      no_duplicates: true
      min_str_length: 30
    train:
      size: 100
      no_duplicates: true
      min_str_length: 30
  1. Run evaluation scripts specifying your config file
# Using quick_evaluation_set.yaml as an example
# configs for translation and evaluation steps will be created automatically
python -m evaluation.01_download_datasets datasets_to_download=quick_evaluation_set
python -m evaluation.02_translate_datasets datasets_to_translate=quick_evaluation_set
python -m evaluation.03_evaluate_translations datasets_to_evaluate=quick_evaluation_set
# Final evaluation results will be in benchmarks folder:
# benchmarks/quick_evaluation_set.tsv
# benchmarks/quick_evaluation_set.tsv

If your have your own dataset

You may want to evaluate translation quality on your own dataset or public data not from OPUS.

  1. Create two parallel files from your dataset, one with source data and one with reference translated data.

  2. Create your own yaml config in configs/dataset_to_translate

    Config file should contain source and target language and list of datasets that we want to translate.

source_language: en
target_language: es

datasets:
  # For your own datasets you may use whatever name you want
- name: PrivateDataset
  # Type is your own tag to differntiate between datasets
  type: General
  # test, dev, train are fixed subsets
  # You can skip any of them in the config but cannot add your own
  test:
    # Path to source and reference file
    # They should be either absolute or relative to the project root folder
    source: NMT_datasets/en.es/PrivateDataset/test.en
    target: NMT_datasets/en.es/PrivateDataset/test.es
    # Number of lines in source and reference files
    size: 9
  dev:
    source: NMT_datasets/en.es/PrivateDataset/dev.en
    target: NMT_datasets/en.es/PrivateDataset/dev.es
    size: 87
  train:
    source: NMT_datasets/en.es/PrivateDataset/train.en
    target: NMT_datasets/en.es/PrivateDataset/train.es
    size: 90
  1. Run evaluation scripts from the second script, specifying your config file
# Using my_private_dataset as an example
# config for evaluation step will be created automatically
python -m evaluation.02_translate_datasets datasets_to_translate=my_private_dataset
python -m evaluation.03_evaluate_translations datasets_to_evaluate=my_private_dataset
# Final evaluation results will be in benchmarks folder:
# benchmarks/my_private_dataset.tsv
# benchmarks/my_private_dataset.tsv

If your have your own machine translation engine

If your MT engine is callable from Python code you can add it to this evaluation framework

  1. Create new class for your MT engine in translator folder.

    translator/translator_empty.py can be used as reference.

from translator.translator_base import TranslatorBase

# Empty translator to test overall translator architecture
class TranslatorEmpty( TranslatorBase ):
    # required init params:
    # max_batch_lines - how many lines your engine can translate in one go
    # max_batch_chars - how many characters should be in one translation batch
    # max_file_lines - if we process text file,
    # how many lines we can read from file in one go to split to batches
    # (rule of thumb is 10*max_batch_lines)
    # verbose - do we want info output or only errors
    # You can add your own additional parameters here
    def __init__( self,
        max_batch_lines = 4, max_batch_chars = 1000, max_file_lines = 40,
        verbose = False ):

        super(TranslatorEmpty, self).__init__(
            max_batch_lines = max_batch_lines, max_batch_chars = max_batch_chars, max_file_lines = max_file_lines,
            verbose = verbose )

        # Your translator object should be created here

        self.logger.info(f'Created Empty translator engine')

    # Main function to translate lines
    # We guarantee that len(lines) <= max_batch_lines and
    # len( ''.join(lines) ) <= max_batch_chars
    def _translate_lines( self, lines ):
        result = [line for line in lines]
        return result

    # Setting source and target language for next calls of _translate_lines
    # Function is needed if we are using multilingual translation engines
    def _set_language_pair( self, source_language, target_language ):
        # Config your engine for the language pair
        return
  1. Add your engine to the enum in translator/create_translator.py

    You can set any class name for the engine, and value should be full path of your translator class

class Translator(Enum):
    Custom = 'translator.translator_custom.TranslatorCustom'
    ...
  1. Add your engine to the configs/translation_engines/engines.yaml config
  # Class name that you set on previous step
- class: Custom 
  # Any name that you want that will specify for you this MT engine
  # with current settings
  name: Custom.option.42
  # Type is your own tag to differntiate between engines
  type: CustomEngine
  settings:
    # Any settings that you need for the engine
    # they will be passed by name to your engine __init__ function
    option: 42
  1. Run evaluation scripts as usual

    Or alternatively, on step 3, you can create your own engine config file and specify it while calling translation script

# Using my_private_dataset and my_private_engine as an example
# config for evaluation step will be created automatically
python -m evaluation.02_translate_datasets datasets_to_translate=my_private_dataset translation_engines=my_private_engine
python -m evaluation.03_evaluate_translations datasets_to_evaluate=my_private_dataset
# Final evaluation results will be in benchmarks folder:
# benchmarks/my_private_dataset.tsv
# benchmarks/my_private_dataset.tsv

Evaluation setup

We evaluated several translation engines for English-to-Spanish translation.

All evaluation results presented here is valid on 7th of July 2022, with the libraries' version defined in requirements.txt and datasets and models downloaded on that day.

Please keep in mind that if you will try to reproduce our results on later date, they may change due to updated models, libraries and cloud MT engines.

Evaluated MT engines

Cloud engines

Pre-trained open source engines

Multi-lingual engines

Datasets

Our work will be impossible without the OPUS project: https://opus.nlpl.eu/

Opus provide collections of different public translation datasets with the API that allows to search and download datasets in one common format.

For our evaluation we used 6 datasets available at Opus. For each dataset we tried to create test set of approximately 5000 lines.

Dataset Test set size Links Description
EMEA 4320 https://opus.nlpl.eu/EMEA.php, http://www.emea.europa.eu/ Parallel corpus made out of PDF documents from the European Medicines Agency
WikiMatrix 5761 https://opus.nlpl.eu/WikiMatrix.php, https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix Parallel corpora from Wikimedia compiled by Facebook Research
TED2020 5249 https://opus.nlpl.eu/TED2020.php, Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation A crawl of nearly 4000 TED and TED-X transcripts from July 2020
OpenSubtitles 3097 https://opus.nlpl.eu/OpenSubtitles-v2018.php, http://www.opensubtitles.org/ Collection of translated movie subtitles from opensubtitles.org
EUbookshop 5313 https://opus.nlpl.eu/EUbookshop.php, Parallel Data, Tools and Interfaces in OPUS Corpus of documents from the EU bookshop
ParaCrawl 5227 https://opus.nlpl.eu/EMEA.php, http://paracrawl.eu/download.html Parallel corpora from Web Crawls collected in the ParaCrawl project
CCAligned 5402 https://opus.nlpl.eu/EMEA.php, CCAligned: A Massive Collection of Cross-lingual Web-Document Pairs Parallel corpora from Commoncrawl Snapshots

Metrics

While BLEU considered most common metrics in Machine Translation, there are other options available that may be more tuned for different use cases.

In our evaluation we implemented several metrics to be able to compare machine translation engines across different dimension.

In our use case, it turned out that all metrics generally agree with each other (if one engine was better than the other by one metric, it was better by all metrics). For that reason, we use only BLEU to show our final evaluation results.

But it is important to notice here that in other uses cases (specifically for fine-grained machine translation quality comparison) other metrics may prove to be more useful.

Metric Link Description
BLEU https://github.com/mjpost/sacrebleu https://www.aclweb.org/anthology/P02-1040.pdf
TER https://github.com/mjpost/sacrebleu https://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf
CHRF https://github.com/mjpost/sacrebleu https://aclanthology.org/W15-3049.pdf
ROUGE https://github.com/pltrdy/rouge https://aclanthology.org/W04-1013.pdf
BERTScore https://github.com/Tiiiger/bert_score https://arxiv.org/abs/1904.09675
COMET https://github.com/Unbabel/COMET https://aclanthology.org/2020.emnlp-main.213/

Evaluation results

All results can be found in benchmarks/main_evaluation_set.(tsv|yaml|xlsx)

General comparison between engines

"General comparison"

Multi-lingual models

"Multi-lingual models"

Beam search vs greedy

"Beam search vs greedy"

Metrics comparison

When we sort all MT engines by one metric (let's say BLEU) all other metrics also became sorted.

"Metrics comparison"

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the Apache License 2.0. See LICENSE for more information.

Maintainers

Acknowledgments

This repo wouldn't be possible without:

About

A useful framework to evaluate and compare different Machine Translation engines between each other on variety datasets.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages