Label-Conditional Synthetic Satellite Imagery

Project from Microsoft AI For Good Lab

Go to Github

Motivation

Generating good-quality synthetic satellite imagery is of importance since satellite imagery is a crucial type of data people use to train machine learning models to address global issues, such as climate change and biodiversity estimation to name just two topics that are very en vogue.

Furthermore, there are a variety of other use cases in established industries such as urban planning, security, agriculture, or even the insurance business. However, high-resolution satellite imagery is infrequently collected and expensive to access, making them a scarce resource.

Moreover, licensing constraints oftentimes prohibit the release of high-resolution satellite images to the public. In contrast, synthetic satellite imagery can be abundant, low-cost, and high-quality at the same time.






Our Solution

We propose a label-conditional synthetic image generation model for creating synthetic satellite imagery datasets.

Given a dataset of real high-resolution imagery and accompanying land cover masks, we show that it is possible to train an upstream class-conditional synthetic imagery generator, use that generator to create a synthetic imagery using the land cover masks, then train a downstream model on the synthetic imagery and land cover masks that achieves similar test set performance to a model that was trained with the real imagery.

Further, we find that incorporating a mixture of real and synthetic imagery acts as a data augmentation method, producing better models than using only real imagery.


Features and experiments

Our basic pipeline consists of an upstream task, synthetic data generation, and a downstream task, testing the usability of the synthetic data. In general, this could be any machine learning models or tasks that use satellite images. In this work, we choose segmentation as our main downstream task.









Label conditional synthetic satellite imagery generation

We adapt conditional GAN with SPADE to the synthetic satellite imagery task and optimize it to generate better images, prepare different datasets for training the downstream models, and evaluate the performances of the downstream segmentation models trained on different datasets.

Go to Github



We trained SPADE using real 3-channel VHR satellite imagery, land cover and building segmentation masks as inputs. Since the baseline model tends to produce almost identical synthetic images given the same masks, we include in training an additional loss term to increase output diversity.

Our results suggest that increasing diversity leads to more photo-realistic synthetic images and better downstream performance in segmentation. Synthetic satellite images are generated at different levels of diversity for our downstream experiments (please refer to our paper for more details).

Above, each row shows three examples of synthetic images generated from random latent representations for a given class mask from the model. The “real imagery” is shown for reference but is not used at inference time.

Downstream Tasks

To test whether our synthetic satellite imagery could be effective data augmentation, we design a series of experiments based on the multi-label landcover segmentation task, where a downstream model classifies an image’s pixels into 6 different land cover labels.


We first explore the segmentation performance of models trained with synthetic imagery with different degrees of diversity, as generated with different lambda diversities.

lambda mIoU FID (1) FID (2)
0 0.2894 72.29 73.31
2 0.3417 63.07 70.72
4 0.3827 61.70 61.38
6 0.4059 56.60 70.98
8 0.3572 58.09 63.46
10 0.3234 60.48 56.37

mIoU reflects the downstream model performance trained on generated images with diversity lambda. FID (1) is calculated with synthetic test images generated without using the trained encoder. FID (2) is calculated with synthetic test images generated using the trained encoder.

Synthetic tiles generated with lambda = 6 results in the highest mIoU scores of 0.4059. We use this diversity parameter value to generate synthetic tiles for most of the following experiments for consistently better performance.


To evaluate the usability of the synthetic images with respect to the real ones, we trained four downstream segmentation models on the following datasets: 100% real, which contains 100 real satellite image tiles in a 3-channel RGB version; 100% syn, 200% syn, 300% syn, which contains 100, 200, 300 synthetic RGB tiles respectively (we generate 3 different synthetic versions for each real tile by changing the input random setting during upstream generation). All synthetic tiles are generated with the upstream model using lambda=6, and the downstream model randomly crops out patches from real and synthetic tiles for training.

The mIoU metric below shows that training on only and more synthetic imagery does not improve segmentation performance, and specifically, the performance on the water class drops significantly.

Training Water Forest Low Vegetation Barren Land Impervious (other) Impervious (road) Mean
100% real 0.6794 0.8386 0.7279 0.1205 0.5302 0.2443 0.5235
100% synthetic 0.4001 0.7332 0.5642 0.0134 0.4085 0.3161 0.4059
200% synthetic 0.5322 0.6956 0.5636 0.0125 0.3677 0.3288 0.4167
300% synthetic 0.2432 0.7402 0.5479 0.0157 0.3316 0.2878 0.3611
100% synthetic (4-channel) 0.9100 0.7476 0.7034 0.0177 0.4143 0.3097 0.5171
100% real (4-channel) 0.9676 0.8532 0.8346 0.1456 0.5665 0.5137 0.6469

Since the NIR channel contains a lot of information about water bodies and thus including it might improve the segmentation performance, we also trained two models on 100% real (4-channel) and 100% synthetic (4-channel). The results are also shown above, please refer to our paper for more details.

Even with only 10% amount of data, the segmentation model trained on 4-channel synthetic images could reach comparable perfor- mance as the model trained on 3-channel real images.


Furthermore, we combine real and synthetic images using different mix-proportions to explore empirically whether or not and how much inclusion of synthetic satellite imagery is a better augmentation strategy in training downstream segmentation models. As shown by the figure below, the model trained with the dataset containing 50% synthetic images could reach higher mIoU (0.5834) and hence better segmentation performance, compared to the model trained only with real images (0.5235).

Our team

We are graduate students from Harvard John A. Paulson School of Engineering and Applied Science. We thank Sarah Rathnam and Weiwei Pan for their help in coordination and communication.

We are proud to be working with Microsoft AI for Good Lab on this project. We thank Caleb Robinson, Simone Fobi Nsutezo, and Anthony Ortiz in the Microsoft AI for Good Research Lab for insightful advice. Their expertise in artificial intelligence and commitment to using technology for social good make them a perfect partner for us.

Sherry(Xinran) Tang: xinran_tang@g.harvard.edu

SM Student in Applied Computation

Mengyuan Li: mengyuan_li@g.harvard.edu

SM Student in Applied Computation

Chelsea(Zixi) Chen: zixichen@g.harvard.edu

SM Student in Applied Computation

Van Anh Le: vananhle@g.harvard.edu

SM Student in Applied Computation

Varshini Reddy: varshinibogolu@g.harvard.edu

SM Student in Applied Computation

Organization