Transformer + 31M monolingual data [bib] |
Transformer |
6 |
1024 |
Back translating 31M monolingual data |
45.6 |
github |
Multi-Agent Dual Learning [bib] |
Transformer |
6 |
1024 |
8M monolingual data + multi-agent training |
43.87 |
No |
Transformer + large batchsizes [bib] |
Transformer |
6 |
1024 |
Large batchsizes + PyTorch implementation |
43 |
github |
DTMT [bib] |
BiGRU |
5 + 10 |
1024 |
1 L-GRU followd by several T-GRUs |
42.02 |
No |
Transformer [bib] |
Transformer |
6 |
1024 |
Tensorflow implementation |
41.8 |
github |
Deliberation network [bib] |
BiLSTM |
4 |
1024 |
Deliberate the output of he first-pass decoder + 8M monolingual data |
41.5 |
No |
The Evolved Transformer [bib] |
The evolved Transformer |
6 |
1024 |
Bilingual data |
41.3 |
No |
CNN network [bib] |
CNN |
15 |
--- |
--- |
40.51 |
No |
Dual transfer learning [bib] |
BiLSTM |
4 |
1024 |
Dual transfer learning + 8M monolingual data |
39.98 |
No |
GNMT [bib] |
BiLSTM |
8 |
1024 |
Finetuned by reinforcement tuning |
39.92 |
No |
Dual supervised learning [bib] |
BiGRU |
1 |
1000 |
Probabilistic duality constraint P(x)P(y|x)=P(y)P(x|y) |
34.84 |
github |
Dual unsupervised learning [bib] |
BiGRU |
1 |
1000 |
Reconstruction duality |
34.83 |
No |
RNNSearch [bib] |
BiGRU |
1 |
1000 |
The first paper introducing the attention model to NMT |
28.45 |
YES |