Pedestrian and Vehicle Activity Detection

Model by Modzy

Identifying and tracking persons and vehicles in videos is a valuable asset for forensic and real-time alerting applications. However, thousands of hours of recorded content are time consuming and difficult to sift through for potential threatening entities. Currently, AI models with an additional component are being researched. These models, activity detectors, are trained to perform detection, tracking, and activity classification on objects of interest. This is an innately complex task not only because of each object’s uniqueness, but also because of the possible interactive activities that occur between multiple objects (e.g., opening car doors). This model tackles this complexity by detecting and tracking people and vehicles in videos, while classifying their singular and interactive activities (Person only, Vehicle only, and Person-Vehicle interactions).

  • Description

    Product Description


    48.4% Partial Area Under (DET) Curve

    79.8% Probability of Missed Detections

    39.2% Time-Based Probability of Missed Detections

    This model was trained and evaluated on the VIRAT and Multiview Extended Video with Activities (MEVA) publicly available datasets. The VIRAT dataset contains about 8.5 hours of HD video data with annotations for 12 event types in 11 different outdoor scenes. The MEVA dataset contains more than 250 hours of ground camera video, with additional resources such as UAV video, camera models, and a subset of 12.5 hours of annotated data. The model was assessed according to the NIST competition criteria, where the normalized, partial Area Under Detection Curve (nAUDC) was used as the primary scoring metric, as well as the mean probability of missed detections (mPmiss) evaluated at a time and instance based false alarm rate of 0.15. This model achieves an nAUDC score of 0.48407, and mPmiss scores of 0.39152 (time-based) and 0.7979 (instance based).

    The partial Area Under DET curve quantifies the relationship between false alarm rate and probability of missed detections over a given confidence interval. Further information here.

    The ratio of the number of true activity instances to the number of missed detections at a presence confidence threshold that results in a specific rate of false alarm value. Further information here.

    The ratio of the number of true activity instances to the number of missed detections at a presence confidence threshold that results in a specific time-based false alarm value (fraction of true non-activity instance time against the duration of falsely system-detected values).


    This model detects, tracks, and classifies person and vehicle activities in video captured by stationary cameras. The model was designed as a modular pipeline and won the NIST Activities in Extended Videos (ActEV) 2019 challenge. The model was trained and validated on videos taken from the VIRAT and MEVA publicly available datasets. It takes an MP4 video file as input and returns a JSON file containing bounding boxes and tracklets (a fragment of the track followed by a moving object) of detected persons and vehicles, their activity labels, and corresponding confidence scores. The activity types can be grouped into three buckets: Person only (enters, exits, etc.), Vehicle only (U-turn, right/left turn, etc.), and Person-Vehicle interactions (person opens vehicle door, person opens vehicle trunk, etc.). More information on the class labels can be found on the ActEV challenge description page.


    This model was trained modularly, i.e. each stage of the pipeline consists of a model that was trained individually, and each model’s outputs then get piped to next. The first stage of the pipeline, object detection and tracking, utilizes the Mask RCNN architecture as its backbone. This model is actively being improved and more details can be found on either GitHub or in the following publication. The next stage, proposal generation, spatially and temporally localizes candidates that pertain to the three main activity buckets explained above by analyzing the person and vehicle trajectories obtained by the first stage. The proposals are then fed to the final stage, feature extraction and activity classification. The feature extraction step uses an I3D architecture, and a bi-directional LSTM was used for temporal activity classification. Both models were fine-tuned on the VIRAT and MEVA datasets.


    This model was evaluated on various validation sets provided by NIST, which consisted of videos taken from a subset of both the VIRAT and MEVA datasets. The scoring software used by NIST can be found here.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.mp4 200M .mp4


    This model will output the following:

    Filename Maximum Size Format
    results.json 5M .json

    The output JSON contains entries for each detected activity. Each listed activity will have a class label (“activity”), confidence score (“presenceConf”), and a list of objects that are involved in the activity (“objects”). This model detects the following activities:

    Closing (P, V) or (P) Closing_trunk (P, V) Entering (P, V) or (P) Exiting (P, V) or (P)
    Loading (P, V) Open_Trunk (P, V) Opening (P, V) or (P) Transport_HeavyCarry (P, V)
    Unloading (P, V) Vehicle_turning_left (V) Vehicle_turning_right (V) Vehicle_u_turn (V)
    Pull (P) Riding (P) Talking (P) activity_carrying (P)
    specialized_talking_phone (P) specialized_texting_phone (P)

    where “(P)” means person only, “(V)” means vehicle only, and “(P,V)” means person-vehicle interaction. Each object involved in an activity has an attribute named ‘objectType’ that will either contain ‘Person’ or ‘Vehicle’, and a set of bounding boxes indexed by their associated frame number. The bounding box information is an attribute of an object, with key-name ‘localization’. Each bounding box lists the box’s top-left “x” and “y” coordinates, as well as the box width, “w”, and height, “h”, in pixels.

    The following example demonstrates the JSON structure:

      "activities": [
          "activity": "Opening",
          "presenceConf": 0.9600149719488054,
          "activityID": 6,
          "alertFrame": 1786,
          "proposal_id": 55,
          "localization": {
            "1449": 1,
            "1786": 0
          "objects": [
              "objectID": 47,
              "objectType": "Person",
              "localization": {
                "1449": {
                  "boundingBox": {
                    "x": 832,
                    "y": 402,
                    "w": 72,
                    "h": 145
              "objectID": 5,
              "objectType": "Vehicle",
              "localization": {
                "1969": {
                  "boundingBox": {
                    "x": 245,
                    "y": 232,
                    "w": 75,
                    "h": 56