This model converts speech from English language 16Khz audio telephony and other files into text. It accepts audio files including WAV, MP3, PCM and other popular formats and outputs text in JSON, XML, TEXT or SRT formats. The model includes punctuation, capitalization, timecodes, word confidence score, and speaker diarization. This model can be used to transcribe telephone calls, meetings and interviews, and other 8Khz recorded content.
Create a Modzy account to get started →
AppTek’s ASR models achieve approximately the same accuracy on real world data as the top cloud service providers. When we build ASR systems for academic tasks, following the comparable training and evaluation conditions as other ASR teams in the community, we achieve state of the art results on popular US English benchmark tasks like LibriSpeech (5.5% on “test-other”) or Switchboard (11.7% on “Hub5 2000 eval”).
AppTek’s acoustic models are backed by bi-directional recurrent neural networks with LSTM units. The models are trained using the RETURNN toolkit — a software package for neural sequence-to-sequence models, developed jointly by of the RWTH Aachen University, Germany and AppTek. The toolkit is built upon the TensorFlow backend and allows flexible and efficient specification, training, and deployment of different neural models.
AppTek trains all ASR models on very large collections of annotated audio data. We compile the training data from a wide variety of sources in order to achieve a high level of generalization.
See how quickly you can deploy and run models, connect to pipelines, autoscale resources, and integrate into workflows with Modzy—the ModelOps and MLOps platform