Kensho Scribe is the world’s leading on-demand, fully-automated transcription service for messy, real-world audio. Transcribe your audio files into human and machine-readable text with state-of-the-art deep learning models.
Scribe is purpose-built to handle the complexities of your audio, leveraging S&P Global’s long history of providing high-quality transcripts to governments and corporations, including more than 100,000 hours of audio and associated text.
Scribe effectively transcribes accented speech and industry-specific terminology, picking up the nuances of spoken language, including mumbling, stuttering, and self-correction.
See the model in action with a Modzy MLOps platform demo or start a trial
1.1 % Character Error Rate (CER) and 3.2% Word Error Rate (WER)
To effectively understand Scribe’s transcription abilities, we compared it to a leading transcription provider’s speech-to-text service. We evaluated both services on 50 randomly selected audio samples spread evenly across 5 categories (hereby referred to as “buckets”) — NPR podcasts, earnings calls, speeches, presentations, and meetings. We selected these five buckets in order to test our competitor’s service and Scribe on a broad range of audio samples that differ from the S&P data that Scribe was trained on and is used to working with. The buckets we selected are either related to finance in some capacity or deal with types of audio that prove problematic for most ASR services (i.e. the meetings bucket, which consists of board-room meetings or town-hall meetings where participants spread throughout a room speak toward a single centralized microphone). Scribe outperformed our competitor’s service in both Character Error Rate (CER) and Word Error Rate (WER). In terms of CER, our competitor’s service and Scribe both were most successful with the speeches bucket and least successful with the meetings bucket. Our competitor obtained a mean CER of 2.1 on the speeches bucket and 16.9 on the meetings bucket, while Scribe obtained a mean CER of 1.1 on the speeches bucket and 10.5 on the meetings bucket. These results indicate that both transcription services were able to more accurately transcribe clear audio from a single speaker, whereas clear to unclear audio featuring multiple speakers with varying amounts of background noise proved more challenging. WER returned a similar result, with our competitor’s service and Scribe both performing the worst on the meetings bucket, obtaining a WER of 21.7 and 17.0 respectively. Our competitor’s service again performed the best on the speeches bucket, with a WER of 4.8. Though Scribe outperformed our competitor in the speeches bucket, with a WER of 3.2, Scribe performed the best on the presentations bucket, achieving a WER of 2.8.
The number of character errors an automatic speech recognition system makes in comparison to the total number of characters.
Measures the performance of speech recognition or machine translation at the word level and is derived from the Levenshtein distance.
Scribe processes every minute of audio in less than a second. Upload audio files for transcription or stream audio in real time to be transcribed live – all with unparalleled accuracy.
See how quickly you can deploy and run models, connect to pipelines, autoscale resources, and integrate into workflows with Modzy MLOps platform
d o n o t fill t h i s . f i e l d
d o n o t fill t h i s . f i e l d