This model returns a textual caption describing the events occurring in an input image. Image captioning refers to the process of producing a natural-language description for an image. Automatically generating informative captions has the potential to assist those with visual impairments by explaining images using text-to-speech systems, provide a mechanism for image search, even generating potential diagnoses given medical imagery. However, accurate image captioning is a challenging task that requires aligning, exploiting, and advancing technologies that intersect the computer vision and natural language processing fields.
Create a Modzy account to get started →
40.8% Bleu Score – A method for assessing the quality of text that has been machine-translated from one language to another. The closer the machine translation is to expert human translation, the better the score.
This model was trained on the Conceptual Captions dataset and validated on the 2014 COCO validation set.
This model uses the ResNet-101 architecture to generate image features and was pretrained on the ImageNet dataset. These features are then passed into a Gated Recurrent Unit Recurrent Neural Network (GRU RNN) model to generate the captions. To choose between the many captions generated, Beam Search is used to find the caption with the highest likelihood. This model is based on the work introduced in this publication.
This model was trained on the Conceptual Captions dataset, which contains approximately 3 million images that have been scraped from the internet along with their text alternatives. Using such a large dataset during training enables the generation of more realistic captions, in comparison to models trained on the similar but smaller, manually curated datasets, such as the COCO dataset.
This model was validated on the 2014 COCO validation set, which consists of 5,000 images and their corresponding human-generated captions.
The input(s) to this model must adhere to the following specifications:
This model will output the following:
A JSON file containing the image filename along with the associated human-readable caption.
Get a video demo of Modzy, the ModelOps and MLOps software platform that businesses use to deploy, integrate, run, and monitor AI—anywhere.