Video Redaction

Model by Modzy

This model redacts user-specified objects within a given video. This capability is useful for ensuring the privacy of entities recorded in the video, as well as censoring sensitive content. This model first identifies the entity of interest, redacts it with a black box, and tracks it throughout the video, ensuring it is never known.

  • Description

    Product Description


    57% Average Accuracy – The average of the accuracies of various classes. Further information here.

    35% Expected Average Overlap (EAO) – The Expected Average Overlap, which is the average amount of time that the predicted bounding box overlaps with the object of interest over a window of frames. Further information here.

    97% Robustness – The fraction of correct predictions made by the classifier on the synthetically generated adversarial test dataset. This metric measures the resiliency of the model against adversarial attacks. Further information here.

    The object tracking model was trained on the Youtube-BBdataset, which contains 100,000 videos. It was tested on datasets for the Visual Object Tracking (VOT) challenges:VOT2015, VOT2016, and VOT2017, each containing 60 videos, as well as OTB2015, which contains 100 videos. The model achieves an average accuracy of 0.57, average robustness of 0.97, and Expected Average Overlap of 0.35.


    This model works in two stages: first, the user-specified objects in the first frame are tracked throughout the video. The object tracking model uses one of two feature extractors: AlexNet and ResNet. The feature extractor is then followed by a Siamese Region Proposal network (Siamese-RPN), which tracks the objects and returns their per-frame bounding boxes. These objects are then redacted by setting all the pixels within their corresponding bounding boxes to black.


    The object tracking model was trained on the Youtube-BB dataset for 50 epochs using stochastic gradient descent with a learning rate which decreased during the training in log space from 0.01 to 0.000001. The redaction portion of this model is deterministic and therefore did not require any training


    The object tracking model was tested on the VOT2015, VOT2016, and VOT2017 datasets, each containing 60 videos. It was also validated using the OTB2015dataset, which contains 100 videos. 


    The inputs to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input 1G .mpg, .mp4
    config.json 1M .json

    The input video cannot have a higher resolution than 4096×2160 and, as a maximum, the first 10,000 frames will be read. The “config.json” file must contain the bounding boxes in the first frame of the objects to be tracked and redacted throughout the video. The format should be as follows:

          "boundingBox": [[x1, y1, width3, height1], [x2, y2, width3, height2]]

    All bounding box values are pixel counts. The values “x1” and “y1” correspond to the top left bounding box corner, with “x1” being the number of pixels from the left of the frame, and “y1” being the number of pixels down from the top of the frame. For example, [10, 20, 30, 40] would signify a 30-by-40 (width-by-height) bounding box with its top left corner located 10 pixels to the right and 20 pixels down from the top left of the frame.


    This model will output the following:

    Filename Maximum Size Format
    output.mpg 1G .mpg

    The output file will be named “output.mpg” and will contain the video with the selected objects tracked and redacted (pixels within the tracked bounding boxes set to black).