Object Tracking

Model by Modzy

This model can visually track objects in real-time speed. This model uses a Siamese region proposal networks (Siamese-RPN) which was trained end-to-end on large datasets to be able to track objects in any domain. The model can accept video frames or a folder of JPEG images of any size and perform tracking on any type of moving object. This model runs at 160 frames per second and achieves leading performance on many known tracking datasets such as VOT2016 and VOT2017. This model can be used to track moving objects such as people, vehicles in consecutive images or video frames.

  • Description

    Product Description


    57% Average Accuracy – The average of the accuracies of various classes. Further information here.

    35% Expected Average Overlap (EAO) – An estimator of the average overlap a tracker is expected to attain on a large collection of short-term sequences with the same visual properties as the given dataset. Further information here.

    97% Robustness – The fraction of correct predictions made by the classifier on the synthetically generated adversarial test dataset. This metric measures the resiliency of the model against adversarial attacks. Further information here.

    The model was tested on datasets for the Visual Object Tracking (VOT) challenges: VOT2015, VOT2016 and VOT2017, each containing 60 videos and OTB2015 with 100 videos. The model reached the average accuracy of 0.57, robustness of 0.97, and EAO of 0.35 on these datasets where, EAO stands for expected average overlap and is measure which takes into account both accuracy and robustness over the frames. Siamese-RPN is able to conduct tracking at 160 frames per second.


    This model was trained to work with two different types of feature extractors: AlexNet and ResNet. The feature extractor then is followed by a Siamese region proposal network which performs object tracking. Region Proposal Network (RPN) was first proposed in the design of Faster R-CNN for object detection models. RPN is capable of extracting precise proposal regions by analyzing the foreground-background features and performing bounding box regression. A Siamese network is a network with two branches which implicitly encodes the original patches to another space and then fuses them with a specific tensor to produce a single output. This architecture is usually used for comparing two sets of features produced by two network branches in a common embedded space for contrastive tasks. This model was implemented in PyTorch.


    This model was trained on the Youtube-BB dataset for 50 epochs using stochastic gradient descent with a learning rate which decreased during the training in log space from 0.01 to 0.000001. Youtube-BB dataset includes 100,000 videos annotated once in every 30 frames.


    The model was tested on the VOT2015, VOT2016 and VOT2017 datasets each containing 60 videos and OTB2015 with 100 videos.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    config.json 1M .json
    input 1G .mp4, .mpg


    This model will output the following:

    Filename Maximum Size Format
    results.json 2M .json