This model can visually track objects in real-time speed. This model uses a Siamese region proposal networks (Siamese-RPN) which was trained end-to-end on large datasets to be able to track objects in any domain. The model can accept video frames or a folder of JPEG images of any size and perform tracking on any type of moving object. This model runs at 160 frames per second and achieves leading performance on many known tracking datasets such as VOT2016 and VOT2017. This model can be used to track moving objects such as people, vehicles in consecutive images or video frames.
Create a Modzy account to get started →
57% Average Accuracy – The average of the accuracies of various classes.
35% Expected Average Overlap (EAO) – An estimator of the average overlap a tracker is expected to attain on a large collection of short-term sequences with the same visual properties as the given dataset.
97% Robustness – The fraction of correct predictions made by the classifier on the synthetically generated adversarial test dataset. This metric measures the resiliency of the model against adversarial attacks.
The model was tested on datasets for the Visual Object Tracking (VOT) challenges: VOT2015, VOT2016 and VOT2017, each containing 60 videos and OTB2015 with 100 videos. The model reached the average accuracy of 0.57, robustness of 0.97, and EAO of 0.35 on these datasets where, EAO stands for expected average overlap and is measure which takes into account both accuracy and robustness over the frames. Siamese-RPN is able to conduct tracking at 160 frames per second.
This model was trained to work with two different types of feature extractors: AlexNet and ResNet. The feature extractor then is followed by a Siamese region proposal network which performs object tracking. Region Proposal Network (RPN) was first proposed in the design of Faster R-CNN for object detection models. RPN is capable of extracting precise proposal regions by analyzing the foreground-background features and performing bounding box regression. A Siamese network is a network with two branches which implicitly encodes the original patches to another space and then fuses them with a specific tensor to produce a single output. This architecture is usually used for comparing two sets of features produced by two network branches in a common embedded space for contrastive tasks. This model was implemented in PyTorch.
This model was trained on the Youtube-BB dataset for 50 epochs using stochastic gradient descent with a learning rate which decreased during the training in log space from 0.01 to 0.000001. Youtube-BB dataset includes 100,000 videos annotated once in every 30 frames.
The model was tested on the VOT2015, VOT2016 and VOT2017 datasets each containing 60 videos and OTB2015 with 100 videos.
The input(s) to this model must adhere to the following specifications:
This model will output the following:
Get a video demo of Modzy, the ModelOps and MLOps software platform that businesses use to deploy, integrate, run, and monitor AI—anywhere.