Image Captioning

Model by Modzy

This model returns a textual caption describing the events occurring in an input image. Image captioning refers to the process of producing a natural-language description for an image. Automatically generating informative captions has the potential to assist those with visual impairments by explaining images using text-to-speech systems, provide a mechanism for image search, even generating potential diagnoses given medical imagery. However, accurate image captioning is a challenging task that requires aligning, exploiting, and advancing technologies that intersect the computer vision and natural language processing fields.

  • Description

    Product Description


    40.8% Bleu Score – A method for assessing the quality of text that has been machine-translated from one language to another. The closer the machine translation is to expert human translation, the better the score. Further information here.

    This model was trained on the Conceptual Captions dataset and validated on the 2014 COCO validation set.


    This model uses the ResNet-101 architecture to generate image features and was pretrained on the ImageNet dataset. These features are then passed into a Gated Recurrent Unit Recurrent Neural Network (GRU RNN) model to generate the captions. To choose between the many captions generated, Beam Search is used to find the caption with the highest likelihood. This model is based on the work introduced in this publication.


    This model was trained on the Conceptual Captions dataset, which contains approximately 3 million images that have been scraped from the internet along with their text alternatives. Using such a large dataset during training enables the generation of more realistic captions, in comparison to models trained on the similar but smaller, manually curated datasets, such as the COCO dataset.


    This model was validated on the 2014 COCO validation set, which consists of 5,000 images and their corresponding human-generated captions.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.jpg 5M .jpg


    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json


    A JSON file containing the image filename along with the associated human-readable caption.