transformer-revisit

A simple revisit of Transformer. Share simple ideas.

This project is maintained by xyc1207

A Study of IWSLT German→English

Dataset

The data is available at https://wit3.fbk.eu/archive/2014-01/texts. We follow [edunov2018classical] to split the training/validation/test sets. There are $153k$/$7k$/$7k$ sentence pairs in the training/validation/test sets. All words are lower-cased.

Model Configuration

We use the tensorflow based Transformer for all experiments. The tensor2tensor version is 1.2.9. We try the following settings:

All experiments are conducted on a single M40 GPU. The batchsize is $6000$ tokens per GPU. Each v2/v1 setting is independently runned for four/two times.

Inference results

We use beam search with beamwidth 5 and lenght penalty 1.0 to generate candidates. The mean and standard derivation are reported of each setting are reported.

Therefore, we recommend to use transformer_small + v2 + $8/10$-layer network.

The BLEU scores reported in existing work are summarized in BLEU scores in existing works

Training performance of IWSLT German$\to$English

Most of existing works focus on the test performances of NMT. In this page, we show the training performance of IWSLT. For ease of reference, we use $L$ to denote the number of layers.

Training loss w.r.t training iterations

The the above picture, the legends are shown in the way ({v1, v2}, #number of layer and dropout).

We have the following observations:

Training loss w.r.t wall-clock time

We can obtain similar conclusions to the curves w.r.t. training iterations.