Overhead Building Detection

Model by Modzy

This model can identify and mark the outline of buildings within an image or photo. It accepts 300 pixel by 300 pixel RGB JPEG files. It outputs a new version of the input image, showing bounding boxes and pixel masks that highlight any detected buildings, along with a JSON file which includes all of the same information. This model can be used in multiple ways, such as to calculate building area footprints, or identify buildings that have been damaged by a natural disaster.

  • Description

    Product Description


    This model was trained on a very large dataset of over 280K overhead images coming from different distributions. This model obtains performance scores of average precision 0.70 and average recall 0.48 for Intersection over Union (IOU) value of 0.5. Some of this model’s strengths include its capability in detecting buildings of different sizes and shapes, its high transferability across domains, i.e., to images from pixel distributions other than what the model was trained on and its fast inference run-time on both CPU and GPU. The model works best with clear overhead imagery, without clouds or noise in the image.

    70% Average Precision – A higher precision score indicates that on average the majority of labels predicted by the model for different classes are accurate. Further information here.

    48% Average Recall – A higher recall score indicates that on average the model finds and predicts correct labels for the majority of the classes it is supposed to find. Further information here.


    This model is based on the Mask RCNN deep neural network, a method first published in 2017 by a Facebook AI team. Mask RCNN is the latest architecture from the R-CNN family of algorithms; it was chosen for this model because it performs object detection and masking more quickly than any of the other RCNN-based architectures. The RCNN approach of detecting objects in an image is different from one-shot learning architectures such as YOLO; it starts by proposing random bounding boxes, then uses convolutional neural networks and regional proposal networks to regress the boxes to specific objects in the images. This approach reaches a higher and more precise detection performance, compared with faster but less accurate architectures such as YOLO and SSD. The model uses the Keras deep learning framework.


    This model was trained on a large dataset of satellite imagery, consisting of 280,741 RGB images. The images in the training and validation datasets come from OpenStreetMap and include a variety of sources including nano-satellites, drones, and conventional high-altitude satellites.

    The model was trained in three stages:

    1. The network heads of the model were trained using stochastic gradient descent for 40 epochs with 1,000 steps per epoch, learning rate of 0.001, and momentum of 0.9.

    2. The rest of the ResNet model was trained using stochastic gradient descent for 120 epochs with 1,000 steps per epoch, learning rate of 0.001, and momentum of 0.9.

    3. All layers were tuned one more time by training of the entire network using stochastic gradient descent for another 160 epochs with 1,000 steps per epoch, a smaller learning rate of 0.0001, and momentum of 0.9.


    The performance of the model was tested on a validation dataset consisting of 60,317 images. Because of the large training and test dataset sizes, there was no need for k-fold cross-validation.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    image 1M .jpg, .png


    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json