Object Position Tracking

Model by Modzy

This model can geodetically track objects in any video containing standard compliant geodetic metadata. This model uses a Siamese region proposal network (Siamese-RPN) which was trained end-to-end on large datasets to be able to track objects in any domain. The model accepts MPEG and MP4 video files packaged according to the MISB ST 0601 standard and outputs the latitude and longitude coordinates of the tracked object in each frame of the video by creating both a JSON formatted set of tracks as well as a zip archive formatted ESRI shapefile.

This model can be used to track moving objects such as people, vehicles in consecutive images or video frames.

  • Description

    Product Description


    57% Average Accuracy – The average of the accuracies of various classes. Further information here.

    35% Expected Average Overlap (EAO) – An estimator of the average overlap a tracker is expected to attain on a large collection of short-term sequences with the same visual properties as the given dataset. Further information here.

    97% Average Robustness – Measures how robustly the model tracks a specific object on average over a sequence of consecutive frames. Further information here.

    The model was tested on datasets for the Visual Object Tracking (VOT) challenges: VOT2015, VOT2016, and VOT2017, each containing 60 videos and OTB2015 with 100 videos. The model reached the average accuracy of 0.57, robustness of 0.97, and EAO of 0.35 on these datasets where, EAO stands for expected average overlap and is measure which takes into account both accuracy and robustness over the frames. Siamese-RPN is able to conduct tracking at 160 frames per second.


    This model was trained to work with two different types of feature extractors: AlexNet and ResNet. The feature extractor then is followed by a Siamese region proposal network which performs object tracking. Region Proposal Network (RPN) was first proposed in the design of Faster R-CNN for object detection models. RPN is capable of extracting precise proposal regions by analyzing the foreground-background features and performing bounding box regression. A Siamese network is a network with two branches which implicitly encodes the original patches to another space and then fuses them with a specific tensor to produce a single output. This architecture is usually used for comparing two sets of features produced by two network branches in a common embedded space for contrastive tasks. This model was implemented in PyTorch.


    This model was trained on the Youtube-BB dataset for 50 epochs using stochastic gradient descent with a learning rate which decreased during the training in log space from 0.01 to 0.000001. Youtube-BB dataset includes 100,000 videos annotated once in every 30 frames.


    The model was tested on the VOT2015, VOT2016 and VOT2017 datasets each containing 60 videos and OTB2015 with 100 videos.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    config.json 1M .json
    input 1G .mp4, .mpg


    This model will output the following:

    Filename Maximum Size Format
    results.json 2M .json