A simple revisit of Transformer. Share simple ideas.
This project is maintained by xyc1207
The data is available at https://wit3.fbk.eu/archive/2014-01/texts. We follow [edunov2018classical] to split the training/validation/test sets. There are $153k$/$7k$/$7k$ sentence pairs in the training/validation/test sets. All words are lower-cased.
We use the tensorflow based Transformer for all experiments. The tensor2tensor version is 1.2.9. We try the following settings:
All experiments are conducted on a single M40 GPU. The batchsize is $6000$ tokens per GPU. Each v2/v1 setting is independently runned for four/two times.
We use beam search with beamwidth 5 and lenght penalty 1.0 to generate candidates. The mean and standard derivation are reported of each setting are reported.
Therefore, we recommend to use transformer_small + v2 + $8/10$-layer network.
The BLEU scores reported in existing work are summarized in BLEU scores in existing works
Most of existing works focus on the test performances of NMT. In this page, we show the training performance of IWSLT. For ease of reference, we use $L$ to denote the number of layers.
The the above picture, the legends are shown in the way ({v1, v2}, #number of layer and dropout).
We have the following observations:
We can obtain similar conclusions to the curves w.r.t. training iterations.