Skip to content

Official repository of "Neural Machine Translating from Natural Language to SPARQL"

License

Notifications You must be signed in to change notification settings

xiaoyuin/tntspa

Repository files navigation

TNTSPA (Translating Natural language To SPARQL)

SPARQL is a highly powerful query language for an ever-growing number of Linked Data resources and Knowledge Graphs. Using it requires a certain familiarity with the entities in the domain to be queried as well as expertise in the language's syntax and semantics, none of which average human web users can be assumed to possess. To overcome this limitation, automatically translating natural language questions to SPARQL queries has been a vibrant field of research. However, to this date, the vast success of deep learning methods has not yet been fully propagated to this research problem.

This paper contributes to filling this gap by evaluating the utilization of eight different Neural Machine Translation (NMT) models for the task of translating from natural language to the structured query language SPARQL. While highlighting the importance of high-quantity and high-quality datasets, the results show a dominance of a CNN-based architecture with a BLEU score of up to 98 and accuracy of up to 94%.

Research Paper

Title: Neural Machine Translating from Natural Language to SPARQL

Authors: Dr. Dagmar Gromann, Prof. Sebastian Rudolph and Xiaoyu Yin

PDF is available

@article{DBLP:journals/corr/abs-1906-09302,
  author    = {Xiaoyu Yin and
               Dagmar Gromann and
               Sebastian Rudolph},
  title     = {Neural Machine Translating from Natural Language to {SPARQL}},
  journal   = {CoRR},
  volume    = {abs/1906.09302},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.09302},
  archivePrefix = {arXiv},
  eprint    = {1906.09302},
  timestamp = {Thu, 27 Jun 2019 18:54:51 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-09302.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Master Thesis

Title: Translating Natural language To SPARQL

Author: Xiaoyu Yin

Supervisor: Dr. Dagmar Gromann, Dr. Dmitrij Schlesinger

The thesis is already finished. (8th January 2019) and has been turned into a paper (link above).

Find the thesis in thesis folder, and defense slides in presentation folder, both available in .tex and .pdf version.

Datasets

Downloads (Google drive)

Usages

The files ended with *.en (e.g. dev.en, train.en, test.en) contain English sentences, *.sparql files contain SPARQL queries. The ones with the same prefix name have 1-1 mapping that was used in the training as a English-SPARQL pair. vocab.* or dict. are vocabulary files. fairseq has its own special requirement of input files, therefore aforementioned files were not used directly by it but processed into binary formats stored in /fairseq-data-bin folder of each dataset.

Sources

The datasets used in this paper were originally downloaded from Internet. I downloaded them and have split them into the way I needed to train the models. The sources are listed as follows:

Experimental Setup

Dataset splits and hyperparameters

see in paper

Hardware configuration

hardware

Results

Raw data

We kept the inference translations of each model and dataset which was used to generate BLEU scores, accuracy, and corresponding graphs in below sections. The results are saved in the format of dev_output.txt (validation set) & test_output.txt (test set) version and available here (compat version).

Full version containing raw output of frameworks is also available

Training

Plots of training perplexity for each models and datasets are available in a separate PDF here.

Test results

Table of BLEU scores for all models and validation and test sets Bleu scores

Table of Accuracy (in %) of syntactically correct generated SPARQL queries | F1 score accuracy

Please find more results and detailed explanations in the research paper and the thesis.

Trained Models

Because some models were so space-consuming (esp. GNMT4, GNMT8) after training for some sepecific datasets (esp. DBNQA), I didn't download all the models from the HPC server. This is an overview of the availablity of the trained models on my drive:

. Monument Monument80 Monument50 LC-QUAD DBNQA
NSpM yes yes yes yes yes
NSpM+Att1 yes yes yes yes yes
NSpM+Att2 yes yes yes yes yes
GNMT4 no yes no no no
GNMT8 no no no no no
LSTM_Luong yes yes yes yes no
ConvS2S yes yes yes yes no
Transformer yes yes yes yes no

One More Thing

This paper and thesis couldn't have been completed without the help of my supervisors (Dr. Dagmar Gromann, Dr. Dmitrij Schlesinger and Prof. Sebastian Rudolph) and those great open source projects. I send my sincere appreciation to all of the people who have been working on this subject, and hopefully we will show the world its value in the near future.

By the way, I work as an Android developer now, although I still have passion with AI and may want to learn more and probably even find a career in it in the future, currently my focus is on Software Engineering. I enjoy any kind of experience or knowledge sharing and would like to have new friends! Connect with me on LinkedIn.