Cleaning Semantic Noise in the E2E dataset

An update release of E2E NLG Challenge data with cleaned MRs and scripts, accompanying the following paper:

Ondřej Dušek, David M. Howcroft, and Verena Rieser (2019): Semantic Noise Matters for Neural Natural Language Generation. In INLG, Tokyo, Japan.

Cleaned data

The fully cleaned E2E NLG Challenge data can be found in cleaned-data. The training and development set are filtered so that they don't overlap the test set, hence the no-ol naming.

The partially cleaned data (see paper) are under partially-cleaned-data. Do not use these unless you have a good reason to do so.

Cleaning process

This is just documenting what we have done to get the cleaned data; you do not need to run this.

1.) Re-annotate MRs in the data (use -t if you want a partial fix only):

./slot_error.py -f train-fixed.csv path/to/trainset.csv
./slot_error.py -f devel-fixed.csv path/to/devset.csv
./slot_error.py -f test-fixed.csv path/to/testset_w_refs.csv

2.) Remove instances with overlapping MRs (after reannotation). Keeps the test set intact; if an instance overlaps between train and dev set, it's removed from the train set:

./remove_overlaps.py train-fixed.csv devel-fixed.csv test-fixed.csv

Experiments with TGen

We used the data with default TGen settings for the E2E Challenge, with validation on the development set (additional training parameter -v input/devel-das.txt,input/devel-text.txt) and evaluation on the test set (both original and cleaned).

To get the plain seq2seq configuration ("TGen-"), we set the classif_filter parameter in the config/config.yaml file to null. To use the slot error script as reranker ("TGen+"), we set classif_filter in the following way:

    classif_filter: {'model': 'e2e_patterns'}

Note that a version of the slot_error.py script is included in TGen code for simpler usage.

System outputs

You can find system outputs of all versions of TGen trained and tested on original and cleaned data under system-outputs. These system outputs were used to obtain the top halves of Table 2 & 3 in the INLG paper.

There are 4 different systems included:

SC-LSTM (Wen et al., 2015)
TGen-minus – TGen without any reranker
TGen-std – TGen with the standard LSTM reranker trained on the same training data
TGen-plus – TGen with the rule-based pattern matching reranker used to clean the data (“oracle”) All systems were run 5 times with different random network initialization (run0-run4).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cleaning Semantic Noise in the E2E dataset

Cleaned data

Cleaning process

Experiments with TGen

System outputs

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
cleaned-data		cleaned-data
partially-cleaned-data		partially-cleaned-data
system-outputs		system-outputs
README.md		README.md
remove_overlaps.py		remove_overlaps.py
slot_error.py		slot_error.py

tuetschek/e2e-cleaning

Folders and files

Latest commit

History

Repository files navigation

Cleaning Semantic Noise in the E2E dataset

Cleaned data

Cleaning process

Experiments with TGen

System outputs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages