transformer-revisit

A simple revisit of Transformer. Share simple ideas.

This project is maintained by xyc1207

A Study of WMT 2014 English→French

Dataset

We use the data provided by WMT2014. There are 36M training sentence sentence pairs. We concatenate newstest2012 and newstest2013 as the validation set (6003 sentence pairs) and use newstest2014 as the test set (3003 sentence pairs).

Model Configuration

We use the pytorch based Transformer for all experiments (version 0.40). We use the transformer_big setting, where $d=1024$ and $d_{ff}=4096$. We try v1 setting with different layers. The dropout ratio is fixed as 0.1.

All experiments are conducted on eight M40 GPU. The batchsize is $4096$ tokens per GPU. We set update-freq as 16 to simulate the 128 GPU environment.

Each v1 setting is runned for one time due to the limitation of computation resources.

Inference results

We use beam search with beamwidth 5 and lenght penalty 1.0 to generate candidates. The results are shown below:

Network Architecture v1
6 layers 43.06
8 layers 42.69
10 layers 42.73
12 layers 42.69

Still, using 6-layer network is the best choice.

The BLEU scores reported in existing work are summarized in BLEU scores in existing works

Training performance of WMT2014 English$\to$French