Scaling Neural Machine Translation.

July 11, 2018, 4:58 p.m. By: Kirti Bakshi

Scaling Neural Machine Translation

Even now, when using a single machine, sequence to sequence learning models require several days to reach state of the art performance on large benchmark datasets. This paper shows that reduced precision and large batch training with careful tuning and implementation can speed up training on a single 8- GPU machine by nearly 5 times.

Introduction:

With the introduction of ever more efficient architectures, Neural Machine Translation in the recent years has seen impressive progress. Similar sequence-to-sequence models are also applied to many other Natural Language Processing tasks. Currently, on large datasets training state-of-the-art models is computationally intensive and on a machine with 8 high-end graphics processing units (GPUs) can require several days.

Scaling training to multiple machines results in the enabling of faster experimental turn-around but on the contemporary also introduces new challenges:

  • What to do in order to maintain efficiency in a distributed setup when some batches process faster than others?

  • How is generalization performance affected by larger batch sizes?

Even though the former is specific to multi-machine training, the latter is a preview of the challenges that, when assuming that hardware improves at the rapid rate as it has till now. Even users of commodity hardware are likely to face the same very soon. In this paper, they first on a single machine focus on exploring approaches to improve training efficiency. By training with the reduced precision they with no effect on accuracy decrease training time by 65%.
Moving on, they by increasing the batch size from 25k to over 400k tokens next assess the effect of the dramatic change, a condition that is necessary for large-scale parallelization with synchronous Stochastic Gradient Descent (SGD). This is then before each update implemented on a single machine by accumulating gradients from several batches. They then find that by training with large batches and by increasing the learning rate they can further on a single machine reduce training time by 40% and in a distributed 16-machine setup by 90%.
The improvements achieved by them enable in just 37 minutes training a WMT’14 En-De model to the same accuracy as Vaswani et al. (2017) on 128 GPUs. When training to full convergence, they in 91 minutes achieve a new state of the art of 29.3 BLEU. These scalability efforts also enable them to further train models on datasets that are much larger.

Adding on, they also show that they, when trained on a combined corpus of WMT and Paracrawl data containing ∼150M sentence pairs, can reach 29.8 BLEU on the same test set in less than 10 hours. Similarly, on the WMT’14 En-Fr task they on 128 GPUs obtain a state of the art BLEU of 43.2 in 8.5 hours.

Experiments and Results:

  • Half-Precision Training

  • Training with Larger Batches

  • Results with WMT Training Data

  • Results with WMT & Paracrawl Training

Conclusions:

In this paper, they have explored how to on large-scale parallel hardware train state-of-the-art NMT models. They then further investigated lower precision computation, very large batch sizes (up to 400k tokens), and larger learning rates. The very careful implementation done by them speeds up the training of a big transformer model (Vaswani et al., 2017) by nearly 5 times on one machine with 8 GPUs. Adding on, The improvements achieved by them enable in just 37 minutes training a WMT’14 En-De model to the same accuracy as Vaswani et al. (2017) on 128 GPUs. When training to full convergence, they in 91 minutes achieve a new state of the art of 29.3 BLEU. These scalability efforts also enable them to further train models on datasets that are much larger. Further, they also show that they, when trained on a combined corpus of WMT and Paracrawl data containing ∼150M sentence pairs, can reach 29.8 BLEU on the same test set in less than 10 hours. Similarly, on the WMT’14 En-Fr task they on 128 GPUs obtain a state of the art BLEU of 43.2 in 8.5 hours.
Overall, their work shows that future hardware will enable training times for large NMT systems that are comparable to phrase-based systems. Though, Future work may consider batching and communication strategies that are much better.

Link To The PDF: Click Here