Video Captioning

Model by Open Source

This model gives a one sentence description of a short video clip. It accepts a short video in MP4 or MPEG format as input. The output is a text description of the contents of the video. This model can be used in to identify content of unseen videos by appropriately describing it. This makes it possible to quickly search a large collection of videos using established text search and retrieval tools.

  • Description

    Product Description


    43.7% Bleu score and 31% Meteor score

    The model was trained on two datasets, the Microsoft Research Video Description Corpus and the Montreal Video Annotation Dataset (MSVD). Metrics used are the METEOR score and BLEU score. Both are commonly used in natural language processing tasks with the Bleu being a modified n-gram precision of the generated sentence against a golden standard and the METEOR being based on explicit word-to-word matching between generated sentences and the golden standard. This model achieves a BLEU score of 0.437 and a METEOR Score of 0.31 on the MSVD dataset.

    Bleu is a method for assessing the quality of text that has been machine-translated from one language to another. The closer the machine translation is to expert human translation, the better the score.

    Further information here.

    Meteor is a metric that is similar to BLEU, but aims to improve the correlation between the machine-translated text and human judgement at the sentence level, as well as at the corpus level.

    Further information here.


    This model relies on a sequence-to-sequence approach with a temporal attention mechanism. It uses a two layer LSTM model based on a sequence-to-sequence model. The first layer generates a concatenation of a word embedding from the previous time step and the hidden state. The second layer incorporates an attention layer which is used to look through different frames of the video. This enables the model to take into account the temporal nature of the video as it generates it caption.

    The videos are down sampled by selecting every 8th frame from the original video and each frame has features extracted from it using a publicly available pre-trained image classification model. These are passed into the above model.


    The model was trained on two datasets, the Microsoft Research Video Description Corpus and the Montreal Video Annotation Dataset.

    The Microsoft Research Video Description Corpus (MSVD) is a set of video clips aggregated from Youtube, containing 1,970 short clips with ≈40 captions/per clip. The videos were collected and annotated by crowdsourcing on Amazon Mechanical Turk. The clips mostly contain a single activity and can be described using only one sentence. For fair comparison with other previous work, they split the dataset into train/validation/test sets. The sizes of the train, validation, and test sets are 1200, 100, and 670, respectively. The sentences and vocabularies were also pre-processed by tokenizing, converting to lower case, and removing punctuations.

    The Montreal Video Annotation Dataset (M-VAD) M-VAD is a large collection of movie clips collected from 92 movies, and spit into 46,589 short clips. Each clip is associated with a description, which can be more than one sentence. The dataset provides an official training/validation/test split, consisting of 36,921, 4,717 and 4,951 video clips respectively. They used all words in the training data as their vocabulary set and only pre-processed the data by tokenizing the sentences.


    This model was tested on the testing sets of the Microsoft Research Video Description Corpus and Montreal Video Annotation Dataset, and achieves a BLUE score of 0.437 and a METEOR score of 0.31.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input 1G .mp4


    This model will output the following:

    Filename Maximum Size Format
    results.wav 100M .wav