Multi-Language OCR

Model by Open Source

This model converts scanned images of text embedded images into electronic text. It takes as input scanned images in multiple formats, including JPG, PNG, and many others; the input can include tables. It produces output in PDF, TSV, plain text, and other formats. The model also supports user-supplied patterns and words, and 107 writing systems (scripts) and languages. It does not process color images or recognize handwriting. This model can be used in multiple ways, such as recovering electronic text from printouts, archival paper documents, and web pages containing only images of text.
  • Description

    Product Description

    PERFORMANCE METRICS:

    96.6% Word Accuracy (300 dpi 8-bit gray scale) and 96.3% Word Accuracy (300 dpi bitonal)

    The training of the adaptive classifier uses a small amount of data: 20 samples of 94 characters from 8 fonts in a single size, with four attributes: normal, bold, italic, bold italic, giving a total of 60,160 training samples. This model gave 96.34% word accuracy for a standard collection of documents provided by DOE when scanned at 300 dpi bitonal, and 96.62% when scanned at 300 dpi 8-bit gray scale, as measured at the Fourth Annual Test of OCR Accuracy. For faxed versions of the business letters (fine mode fax), the model gave 95.30% word accuracy. This model handles multiple image formats, handles tables, and does page segmentation. It supports user-supplied patterns and words, and 107 writing systems (scripts) and languages. It does not process color images or recognize handwriting.

    Measures how close the output text is to the reference text at the word level using the Levenshtein distance. If the reference text is a substring in the output text, it is considered correct.

    Further information here.

    Measures how close the output text is to the reference text at the word level using the Levenshtein distance. If the reference text is a substring in the output text, it is considered correct.

    Further information here.

     

    OVERVIEW:

    This model uses a number of processing steps, including a connected components technique in which outlines of the components are stored, gathering outlines into approximate shapes (‘blobs’), followed by a two-phase process for word recognition: a first pass that attempts to recognize each word, followed by the satisfactory words being passed to an adaptive classifier as training data.

    TRAINING:

    The training dataset includes a small amount of data: 20 samples of 94 characters from 8 fonts in a single size, with four attributes: normal, bold, italic, bold italic, giving a total of 60,160 training samples.

    VALIDATION:

    This model was tested against several datasets, representing different types of documents: original business letters (319K characters), a sample from DOE (1.4M characters), a sample of magazines (666K characters), English newspapers (492K characters), and Spanish newspapers (348K characters). In each of these tests, generally done with two different scanning resolutions, the metrics included the number of errors and the accuracy, at both the word and character levels.

    INPUT SPECIFICATION

    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.txt
    config.json
    1M
    1M
    .jpg, .png
    .json

    OUTPUT DETAILS

    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json