Label-Conditional Synthetic Satellite Imagery
Project from Microsoft AI For Good Lab
Go to
Github
Motivation
Generating good-quality synthetic satellite imagery is of importance since satellite imagery is a crucial
type of data people use to train machine learning models to address global issues, such as climate change
and biodiversity estimation to name just two topics that are very en vogue.
Furthermore, there are a variety
of other use cases in established industries such as urban planning, security, agriculture, or even the
insurance business. However, high-resolution satellite imagery is infrequently collected and expensive to
access, making them a scarce resource.
Moreover, licensing constraints oftentimes prohibit the release of
high-resolution satellite images to the public. In contrast, synthetic satellite imagery can be abundant,
low-cost, and high-quality at the same time.
Our Solution
We propose a label-conditional synthetic image generation model for creating synthetic satellite imagery
datasets.
Given a dataset of real high-resolution imagery and accompanying land cover masks, we show that it is
possible to train an upstream class-conditional synthetic imagery generator, use that generator to create a
synthetic imagery using the land cover masks, then train a downstream model on the synthetic imagery and
land cover masks that achieves similar test set performance to a model that was trained with the real
imagery.
Further, we find that incorporating a mixture of real and synthetic imagery acts as a data augmentation
method, producing better models than using only real imagery.
Features and experiments
Our basic pipeline consists of an upstream task, synthetic data generation, and a downstream task,
testing the usability of the synthetic data. In general, this could be any machine learning models or
tasks that use satellite images. In this work, we choose segmentation as our main downstream task.
-
We adapt a conditional GAN model with spatially-adaptive
normalization (SPADE) to overcome the lack of
diversity from the synthetic
output.
-
Synthetic satellite images as substitute or augmentation for real datasets to
build comparable models.
Label conditional synthetic satellite imagery generation
We adapt conditional GAN with SPADE to the synthetic satellite imagery task and optimize it to generate
better images, prepare different datasets for training the downstream models, and evaluate the performances
of the downstream segmentation models trained on different datasets.
Go to Github
We trained SPADE
using real 3-channel VHR satellite imagery, land cover and building segmentation masks as inputs.
Since the baseline model tends to produce almost identical synthetic images given the same masks,
we include in training an additional loss term to increase output
diversity.
Our results suggest that increasing diversity leads to more photo-realistic synthetic images
and better downstream performance in segmentation. Synthetic satellite images are generated at
different levels of diversity for our downstream experiments (please refer to our paper for more details).
Above, each row shows three examples of synthetic images generated from random latent representations for a
given class mask from the model. The “real imagery” is shown for reference but is not used at
inference time.
Downstream Tasks
To test whether our synthetic satellite imagery could be effective data augmentation, we design a
series of experiments based on the multi-label landcover segmentation task, where a downstream
model classifies an image’s pixels into 6 different land cover labels.
We first explore the segmentation performance of models trained with synthetic imagery with different
degrees of diversity, as generated with different lambda diversities.
lambda |
mIoU |
FID (1) |
FID (2) |
0 |
0.2894 |
72.29 |
73.31 |
2 |
0.3417 |
63.07 |
70.72 |
4 |
0.3827 |
61.70 |
61.38 |
6 |
0.4059 |
56.60 |
70.98 |
8 |
0.3572 |
58.09 |
63.46 |
10 |
0.3234 |
60.48 |
56.37 |
mIoU reflects the downstream model performance trained on generated images with diversity lambda. FID (1) is
calculated with synthetic
test images generated without using the trained
encoder. FID (2) is calculated with synthetic test images generated using the trained encoder.
Synthetic tiles
generated with lambda = 6 results in the highest mIoU scores of 0.4059. We use this diversity parameter
value to generate synthetic tiles for most of the following experiments for consistently better
performance.
To evaluate the usability of the synthetic images with respect to the real ones, we trained four downstream
segmentation models on the
following datasets: 100% real, which contains 100 real satellite image tiles in a 3-channel RGB version;
100% syn, 200% syn, 300% syn, which contains 100, 200, 300 synthetic RGB tiles respectively (we generate
3 different synthetic versions for each real tile by changing the input random setting during upstream
generation). All synthetic tiles are generated with the upstream model using lambda=6, and the downstream
model randomly crops out patches from real and synthetic tiles for training.
The mIoU metric below shows that training on only and more synthetic imagery
does not improve segmentation performance, and specifically, the performance on the water class
drops significantly.
Training |
Water |
Forest |
Low Vegetation |
Barren Land |
Impervious (other) |
Impervious (road) |
Mean |
100% real |
0.6794 |
0.8386 |
0.7279 |
0.1205 |
0.5302 |
0.2443 |
0.5235 |
100% synthetic |
0.4001 |
0.7332 |
0.5642 |
0.0134 |
0.4085 |
0.3161 |
0.4059 |
200% synthetic |
0.5322 |
0.6956 |
0.5636 |
0.0125 |
0.3677 |
0.3288 |
0.4167 |
300% synthetic |
0.2432 |
0.7402 |
0.5479 |
0.0157 |
0.3316 |
0.2878 |
0.3611 |
100% synthetic (4-channel) |
0.9100 |
0.7476 |
0.7034 |
0.0177 |
0.4143 |
0.3097 |
0.5171 |
100% real (4-channel) |
0.9676 |
0.8532 |
0.8346 |
0.1456 |
0.5665 |
0.5137 |
0.6469 |
Since the NIR channel contains a lot of information about water bodies and thus including it might
improve the segmentation performance, we also trained two models on 100% real (4-channel) and 100% synthetic
(4-channel). The results are also shown above, please refer to our paper for more details.
Even with only 10% amount of
data, the segmentation model trained on 4-channel synthetic images could reach comparable perfor-
mance as the model trained on 3-channel real images.
Furthermore, we combine real and synthetic images using different mix-proportions to explore empirically
whether or not and how much inclusion of synthetic satellite imagery is a better augmentation strategy in
training downstream segmentation models. As shown by the figure below, the model trained with the dataset
containing
50% synthetic images could reach higher mIoU (0.5834) and hence better segmentation performance, compared
to the model trained only with real images (0.5235).
Our team
We are graduate students from Harvard John A. Paulson School of Engineering and Applied
Science. We thank Sarah Rathnam and Weiwei Pan for their help in coordination and communication.
We are proud to be working with Microsoft AI for Good Lab on this project. We
thank Caleb Robinson, Simone Fobi Nsutezo, and Anthony Ortiz in the Microsoft AI for Good Research Lab for
insightful advice.
Their expertise in
artificial intelligence and commitment to using technology for social good make them a perfect partner for
us.
Sherry(Xinran) Tang: xinran_tang@g.harvard.edu
SM Student in Applied Computation
Mengyuan Li: mengyuan_li@g.harvard.edu
SM Student in Applied Computation
Chelsea(Zixi) Chen: zixichen@g.harvard.edu
SM Student in Applied Computation
Van Anh Le: vananhle@g.harvard.edu
SM Student in Applied Computation
Varshini Reddy: varshinibogolu@g.harvard.edu
SM Student in Applied Computation