Machine Translation English Spanish

Model by AppTek

This model automatically translates between English and Spanish languages utilizing neural machine translation technology. It accepts text from a source language and translates that text into a selected target language. This model can be used on general domain translation and localization of content including media and entertainment, news, e-commerce, and other non-technical content.

  • Description

    Product Description


    AppTek’s Spanish-to-English model obtains a BLEU score of 34.5% on the 2013 news test set of the official Workshop on Statistical Machine Translation (WMT) evaluation (European Spanish). The best scoring system in 2013 obtained a score of 30.4%. On a held-out movie subtitle test set with full episode subtitles, the same model obtains a BLEU score of 38.7 %, outperforming one of the leading on-line Machine Translation (MT) service providers by 6% relative. The model for the reverse direction, Spanish-to-English, is trained on the same data and can translate all Spanish language varieties. It reaches a BLEU score of 45.5% on a held-out movie subtitle test set with full episode subtitles, outperforming one of the top on-line MT service providers who only obtain a BLEU score of 42.3%. At the same time, the model is also strong on the news content, obtaining a case-sensitive BLEU score of 35.2% on the news test set 2013 of the official WMT evaluation. The winner of that evaluation in 2013 obtained a BLEU score of 31.4%.

    Bleu score is a method for assessing the quality of text that has been machine-translated from one language to another. The closer the machine translation is to expert human translation, the better the score.

    Further information here.


    AppTek’s neural translation model is based on state-of-the-art recurrent neural network and Transformer architectures. Such architectures employ cross-lingual attention mechanisms and, in case of the Transformer, multi-head, multi-layer self-attention mechanisms, which have been proven to significantly boost translation quality. The models are trained using the RETURNN toolkit — a software package for neural sequence-to-sequence models, developed jointly by of the RWTH Aachen University, Germany and AppTek. The toolkit is built upon the TensorFlow backend and allows flexible and efficient specification, training, and deployment of different neural models. The model is trained on large amounts of parallel sentence-aligned data with 37 million sentence pairs and 557 million running words. This model covers a wide range of domains, including news, entertainment, government, talk, and technical content. We employ a data augmentation technique to mine translations of rare words from less reliable or very domain-specific parallel corpora which are not part of the main training corpus. Within a single model, we support the European and the Latin American Spanish varieties.


    Our models are trained using segmentation of source and target words into sub-word units, with the sub-word vocabulary size ranging from 20 to 40 thousand units. In the RNN-based attention model, both the source and the target sub-words are projected into a 620-dimensional embedding space. The models are equipped with either 4 or 6 layers of bidirectional encoder using LSTM cells with 1,000 units. A unidirectional decoder with the same number of units is used in all cases. The model is trained a layer-wise pre-training scheme that leads to both better convergence and faster training speed during the initial pre-train epochs. In the Transformer model, both the self-attentive encoder and the decoder consist of 6 stacked layers. Every layer is composed of two sub-layers – a 8-head self-attention layer followed by a rectified linear unit (ReLU). We applied layer normalization before each sub-layer, whereas dropout and residual connections were applied afterwards. All projection layers and the multi-head attention layers consist of 512 nodes followed by a feedforward layer equipped with 2,048 nodes. The models are trained using the Adam optimization algorithm with a learning rate of 0.001 for the attention RNN-based model and 0.0003 for the Transformer model. We apply a learning rate scheduling based on the perplexity on the validation set for a few consecutive evaluation checkpoints. We also employ label smoothing of 0.1 for all trainings. The dropout rate ranged from 0.1 to 0.3.


    Not available.