A simple revisit of Transformer. Share simple ideas.
This project is maintained by xyc1207
We use the data provided by WMT2014. There are 36M training sentence sentence pairs. We concatenate newstest2012 and newstest2013 as the validation set (6003 sentence pairs) and use newstest2014 as the test set (3003 sentence pairs).
We use the pytorch based Transformer for all experiments (version 0.40). We use the transformer_big setting, where $d=1024$ and $d_{ff}=4096$. We try v1 setting with different layers. The dropout ratio is fixed as 0.1.
All experiments are conducted on eight M40 GPU. The batchsize is $4096$ tokens per GPU. We set update-freq as 16 to simulate the 128 GPU environment.
Each v1 setting is runned for one time due to the limitation of computation resources.
We use beam search with beamwidth 5 and lenght penalty 1.0 to generate candidates. The results are shown below:
Network Architecture | v1 |
---|---|
6 layers | 43.06 |
8 layers | 42.69 |
10 layers | 42.73 |
12 layers | 42.69 |
Still, using 6-layer network is the best choice.
The BLEU scores reported in existing work are summarized in BLEU scores in existing works