A Study of IWSLT German→English

Dataset

The data is available at https://wit3.fbk.eu/archive/2014-01/texts. We follow [edunov2018classical] to split the training/validation/test sets. There are $153k$/$7k$/$7k$ sentence pairs in the training/validation/test sets. All words are lower-cased.

Model Configuration

We use the tensorflow based Transformer for all experiments. The tensor2tensor version is 1.2.9. We try the following settings:

transformer_small + v2: The hidden dimension and filter size are as 256 and 1024 respectively. We try different dropout rates {0.1, 0.3, 0.4} and different number of layers $L\in\{2, 4, 6, 8, 10\}$. The number of the “heads” in MultiHead attention is 4;
transformer_base + v2: The hidden dimension and filter size are as 512 and 1024 respectively. We try different dropout rates {0.4, 0.5} and different number of layers {6, 8}. The number of the “heads” in MultiHead attention is 8;
transformer_small + v1: The configurations are the same as transformer_small + v2.

All experiments are conducted on a single M40 GPU. The batchsize is $6000$ tokens per GPU. Each v2/v1 setting is independently runned for four/two times.

Inference results

We use beam search with beamwidth 5 and lenght penalty 1.0 to generate candidates. The mean and standard derivation are reported of each setting are reported.

Therefore, we recommend to use transformer_small + v2 + $8/10$-layer network.

The BLEU scores reported in existing work are summarized in BLEU scores in existing works

Training performance of IWSLT German$\to$English

Most of existing works focus on the test performances of NMT. In this page, we show the training performance of IWSLT. For ease of reference, we use $L$ to denote the number of layers.

Training loss w.r.t training iterations

The the above picture, the legends are shown in the way ({v1, v2}, #number of layer and dropout).

We have the following observations:

With transformer_small, for both v1 and v2 settings, when fixing dropout rate as 0.3, the training loss decreases w.r.t. the layer numbers. Increasing the dropout rate will hurt the training performances.
We can obtain similar conclusion for transformer_base.
When $L\le8$, v1 converges faster and better (i.e., eventually to a lower training loss) than v2;
v1 setting cannot converge when $L\ge12$. (Results not shown here).
transformer_base + 6-layer network achieves the lowest training loss among all settings. That is, widening the network can efficiently reduce the training error.

Training loss w.r.t wall-clock time

We can obtain similar conclusions to the curves w.r.t. training iterations.