WMT 2014 English→French

Paper Basic Architecture #Layer #Hidden Algorithm BLEU Open-Sourced
Transformer + 31M monolingual data [bib] Transformer 6 1024 Back translating 31M monolingual data 45.6 github
Multi-Agent Dual Learning [bib] Transformer 6 1024 8M monolingual data + multi-agent training 43.87 No
Transformer + large batchsizes [bib] Transformer 6 1024 Large batchsizes + PyTorch implementation 43 github
DTMT [bib] BiGRU 5 + 10 1024 1 L-GRU followd by several T-GRUs 42.02 No
Transformer [bib] Transformer 6 1024 Tensorflow implementation 41.8 github
Deliberation network [bib] BiLSTM 4 1024 Deliberate the output of he first-pass decoder + 8M monolingual data 41.5 No
The Evolved Transformer [bib] The evolved Transformer 6 1024 Bilingual data 41.3 No
CNN network [bib] CNN 15 --- --- 40.51 No
Dual transfer learning [bib] BiLSTM 4 1024 Dual transfer learning + 8M monolingual data 39.98 No
GNMT [bib] BiLSTM 8 1024 Finetuned by reinforcement tuning 39.92 No
Dual supervised learning [bib] BiGRU 1 1000 Probabilistic duality constraint P(x)P(y|x)=P(y)P(x|y) 34.84 github
Dual unsupervised learning [bib] BiGRU 1 1000 Reconstruction duality 34.83 No
RNNSearch [bib] BiGRU 1 1000 The first paper introducing the attention model to NMT 28.45 YES