This model returns a textual caption describing the events occurring in an input image. Image captioning refers to the process of producing a natural-language description for an image. Automatically generating informative captions has the potential to assist those with visual impairments by explaining images using text-to-speech systems, provide a mechanism for image search, even generating potential diagnoses given medical imagery. However, accurate image captioning is a challenging task that requires aligning, exploiting, and advancing technologies that intersect the computer vision and natural language processing fields.
40.8% Bleu Score – A method for assessing the quality of text that has been machine-translated from one language to another. The closer the machine translation is to expert human translation, the better the score. Further information here.
This model was trained on the Conceptual Captions dataset and validated on the 2014 COCO validation set.
This model uses the ResNet-101 architecture to generate image features and was pretrained on the ImageNet dataset. These features are then passed into a Gated Recurrent Unit Recurrent Neural Network (GRU RNN) model to generate the captions. To choose between the many captions generated, Beam Search is used to find the caption with the highest likelihood. This model is based on the work introduced in this publication.
This model was trained on the Conceptual Captions dataset, which contains approximately 3 million images that have been scraped from the internet along with their text alternatives. Using such a large dataset during training enables the generation of more realistic captions, in comparison to models trained on the similar but smaller, manually curated datasets, such as the COCO dataset.
This model was validated on the 2014 COCO validation set, which consists of 5,000 images and their corresponding human-generated captions.
The input(s) to this model must adhere to the following specifications:
This model will output the following:
A JSON file containing the image filename along with the associated human-readable caption.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience and Modzy product offering.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.