Machine Translation English Arabic

Model by AppTek

This model automatically translates between English and Arabic languages utilizing neural machine translation technology. It accepts text from a source language and translates that text into a selected target language. This model can be used on general domain translation and localization of content including media and entertainment, news, e-commerce, and other non-technical content.

  • Description

    Product Description


    AppTek’s Arabic-to-English model reaches a case-sensitive BLEU score of 35.9% on the official NIST OpenMT 2012 newswire evaluation set (using a single reference translation). On a heldout broadcast news set from one of the top TV stations broadcasting in Modern Standard Arabic, the model reaches a BLEU score of 30.9%, outperforming one of the top on-line MT service providers by 35% relative. The English-to-Arabic model reaches a BLEU score of 27.7% on the official NIST OpenMT 2012 newswire evaluation set. This is more than 45% relative better than a score of one of the top on-line MT service providers on the same evaluation set.

    Bleu score is a method for assessing the quality of text that has been machine-translated from one language to another. The closer the machine translation is to expert human translation, the better the score.

    Further information here.


    AppTek’s neural translation model is based on state-of-the-art recurrent neural network and Transformer architectures. Such architectures employ cross-lingual attention mechanisms and, in case of the Transformer, multi-head, multi-layer self-attention mechanisms, which have been proven to significantly boost translation quality. The models are trained using the RETURNN toolkit — a software package for neural sequence-to-sequence models, developed jointly by of the RWTH Aachen University, Germany and AppTek. The toolkit is built upon the TensorFlow backend and allows flexible and efficient specification, training, and deployment of different neural models. The models are trained on large amounts of parallel data, including AppTek’s proprietary data collected and created over the last 25 years. The total number of parallel sentence pairs used for training is 47 million, with 672 million running words. We employ a data augmentation technique to mine translations of rare words from less reliable or very domain-specific parallel corpora which are not part of the main training corpus. The model covers a large variety of domains, including newswire, web blogs, broadcast news and conversations, government and military, as well as entertainment and technology. The Arabic-to-English model includes normalization rules for the Arabic language so that it can deal with various types of (partially noisy) input, including input with and without diacritics, spelling errors, etc.


    Our models are trained using segmentation of source and target words into sub-word units, with the sub-word vocabulary size ranging from 20 to 40 thousand units. In the RNN-based attention model, both the source and the target sub-words are projected into a 620-dimensional embedding space. The models are equipped with either 4 or 6 layers of bidirectional encoder using LSTM cells with 1,000 units. A unidirectional decoder with the same number of units is used in all cases. The model is trained a layer-wise pre-training scheme that leads to both better convergence and faster training speed during the initial pre-train epochs. In the Transformer model, both the self-attentive encoder and the decoder consist of 6 stacked layers. Every layer is composed of two sub-layers – a 8-head self-attention layer followed by a rectified linear unit (ReLU). We applied layer normalization before each sub-layer, whereas dropout and residual connections were applied afterwards. All projection layers and the multi-head attention layers consist of 512 nodes followed by a feedforward layer equipped with 2,048 nodes. The models are trained using the Adam optimization algorithm with a learning rate of 0.001 for the attention RNN-based model and 0.0003 for the Transformer model. We apply a learning rate scheduling based on the perplexity on the validation set for a few consecutive evaluation checkpoints. We also employ label smoothing of 0.1 for all trainings. The dropout rate ranged from 0.1 to 0.3.


    Not available.