A Study of WMT 2014 English→German

Dataset

We use the data provided by WMT2014. There are 4.5M training sentence sentence pairs. We concatenate newstest2012 and newstest2013 as the validation set (6003 sentence pairs) and use newstest2014 as the test set (3003 sentence pairs).

Model Configuration

We use the pytorch based Transformer for all experiments (version 0.50). We use the transformer_big setting, where $d=1024$ and $d_{ff}=4096$. We try v1 and v2 setting with different layers. The dropout ratio is fixed as $0.3$ for all settings.

All experiments are conducted on eight M40 GPU. The batchsize is $4096$ tokens per GPU. We set update-freq as 16 to simulate the 128 GPU environment.

We surprisinly find that the baseline is extremely strong.

Each v1/v2 setting is runned for one time due to the limitation of computation resources.

Inference results

We use beam search with beamwidth 5 and lenght penalty 1.0 to generate candidates. The results are shown below:

Network Architecture	v1
6 layers	29.12
8 layers	28.75
10 layers	28.63
12 layers	fail

Still, using 6-layer network is the best choice.

transformer-revisit

A Study of WMT 2014 English→German

Dataset

Model Configuration

Inference results

Training performance of WMT2014 English$\to$German