General Object Detection in Overhead Imagery

Model by Open Source

The ability to quickly and accurately identify objects in aerial imagery can be useful in a variety of scenarios, such as urban planning, crop surveillance, and traffic surveillance. This model is trained to do so for 60 different types of objects (including buses, vehicle lots, buildings, oil tankers, etc.) using one of the largest and highest quality public datasets of annotated high-resolution satellite imagery. Covering a total area of over 45,000 square kilometers, this dataset contains many variations of each of the 60 object classes, resulting in a robust model. This model is Open Source but was developed by the Modzy data science team.

  • Description

    Product Description


    23.2% 60-class mAP

    The mean of Average Precision scores across all classes. Further information here.

    This model was trained and validated using the public xView2dataset, which consists of high-resolution satellite imagery annotated with building locations and damage scores before and after natural disasters. This model achieves a 60-class mean average precision (mAP) score of 0.2319, calculated using the methodology outlined by SIMRDWN. An Intersection over Union (IoU) threshold of 0.5 was used for most classes and a threshold of 0.25 was used for objects that are comparably smaller, such as vehicles. The model performs best on electro-optical satellite imagery with a ground sample distance of 0.3 meters. For fast inference time, this model should have access to at least one GPU.



    This model detects 60 classes of objects within overhead electro-optical (EO) satellite imagery. It utilizes an adapted version of YOLO called YOLT2, provided by the open source SIMRDWN framework. YOLT2 is a pipeline that is tailored towards satellite imagery, where larger convolutional filters are replaced by smaller 3×3 filters, refining the model’s ability of detecting small objects from a distance. This model was trained and validated using the public xView2 dataset, and accepts a TIFF, PNG, or JPEG image as its input. It returns a JSON file containing detected bounding boxes and corresponding object class names and their corresponding confidence scores.


    The training set consists of 80% of the original images and annotations, randomly sampled without replacement, from the public xView2 dataset. The images were then chipped to fit the YOLT network’s window size of 416 x 416 pixels and processed at their native resolution. Image chips that contained no objects were discarded.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input 100M .png, .jpg, .jpeg,.tiff

    The input file should be a 3-channel electro-optical satellite image.


    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json

    The output file (results.json) will contain detected vehicle bounding boxes. Each bounding box will contain the corresponding class name, confidence score, and top left/bottom right x,y coordinates defining the box. This model can detect the following object classes:

    Fixed-wing Aircraft Small Aircraft Cargo Plane Helicopter
    Small Car Bus Pickup Truck Utility Truck
    Truck Cargo Truck Truck with Box Truck Tractor
    Trailer Truck with Flatbed Truck with Liquid Crane Truck
    Railway Vehicle Passenger Car Cargo Car Flat Car
    Tank Car Locomotive Maritime Vessel Motorboat
    Sailboat Tugboat Barge Fishing Vessel
    Ferry Yacht Container Ship Oil Tanker
    Engineering Vehicle Tower Crane Container Crane Reach Stacker
    Straddle Carrier Mobile Crane Dump Truck Haul Truck
    Scraper or Tractor Front Loader or Bulldozer Excavator Cement Mixer
    Ground Grader Hut or Tent Shed Building
    Aircraft Hangar Damaged Building Facility Construction Site
    Vehicle Lot Helipad Storage Tank Shipping Container Lot
    Shipping Container Pylon Tower Passenger Vehicle