Korean J. Remote Sens. 2024; 40(5): 551-568
Published online: October 31, 2024
https://doi.org/10.7780/kjrs.2024.40.5.1.11
© Korean Society of Remote Sensing
Correspondence to : Yongil Kim
E-mail: yik@snu.ac.kr
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Extracting building information from Very-High-Resolution (VHR) satellite images is critical for urban mapping and monitoring. Traditional manual annotation methods are labor-intensive and costly, making automated solutions highly desirable. Segment Anything Model (SAM), a foundation model trained mostly on natural images, has recently shown high performance on diverse segmentation tasks. However, due to differences in perspective and the average size of objects in the images, SAM exhibits lower performance when extracting buildings from satellite imagery. These limitations, derived from differences in image domains, can be addressed by fine-tuning the model with satellite images and preprocessing the input images. However, various hyperparameters, such as learning rate, batch size, and optimizer type, deeply impact the performance of the fine-tuned model, and thus, in-depth investigations on these hyperparameters are critical for model adaptation. To identify the optimal hyperparameter configuration, we conducted extensive experiments with combinations of hyperparameter settings using Korea Multi-Purpose Satellite (KOMPSAT) images. Additionally, various upscaling methods and object-by-object preprocessing techniques were compared and evaluated, leading to the proposal of an effective preprocessing approach. With the optimal combination, an F1 Score of 0.862, an Intersection over Union (IoU) of 0.761, and a mean IoU (mIoU) of 0.705 were achieved using AdamW optimizer, object-by-object cropping, and 100-pixel buffering. The proposed hyperparameter optimization method in our research underscores the effectiveness of fine-tuning SAM for accurate building extraction in VHR satellite imagery, thereby enabling more reliable data interpretation and decision-making processes in automated remote sensing applications.
Keywords Building extraction, Segment anything model, Hyperparameter optimization, KOMPSAT
Extracting accurate spatial information on buildings is critical in urban planning, disaster monitoring, agricultural production prediction, and population estimation. Furthermore, the increase in very high-resolution satellite imagery has opened up new opportunities for extracting accurate building information. Accordingly, semantic segmentation (Xu et al., 2018), which involves annotating an image at the pixel level, is considered one of the most important subtasks in the analysis of building spatial information from optical satellite imageries.
Manual annotation of buildings requires an excessive amount of human intervention, which limits its practicality and scalability (Zheng et al., 2023). Moreover, the complex scenes that appear in satellite images, including the diversity of building types, shapes, sizes, and colors, greatly limit automated building extraction. Furthermore, buildings are usually placed closer to other buildings or roads, which makes it hard for machines to distinguish the boundaries of various objects. Considering the ability of Deep Neural Networks (DNNs) to incorporate both shallow and deep features into segmentation, the use of deep learning methods for automated building extraction has been investigated.
Convolutional Neural Networks (CNNs), which incorporate convolution layers on DNNs to deal with two-dimensional image data, have demonstrated exceptional ability in building segmentation (Krizhevsky et al., 2017). Therefore, Fully Convolutional Neural Networks (FCNs), an extension of CNN, were proposed to address arbitrary-sized images (Long et al., 2016). However, FCNs generate coarse-level building prediction. Two approaches were proposed to overcome this problem. The first approach involves using U-shaped encoder-decoder structures, such as U-Net (Ronneberger et al., 2015) and SegNet (Badrinarayanan et al., 2015) by constructing symmetrical decoders to encoders. The encoder captures features from the image and represents them in high-dimensional features, and the decoder restores the fine-resolution data from both high-dimensional features and low-dimensional features by using skip connections. Another approach is constructing multiscale networks, such as HRNet (Sun et al., 2019). By repeatedly fusing multiscale data, high-resolution features can be preserved.
However, CNN-based models suffer from drawbacks such as a high computation burden on pixel-level classification (Guo et al., 2018). Furthermore, CNN-based models have limitations in elaborating global information. In contrast, transformer architectures that utilize attention mechanisms (Vaswani et al., 2017) offer distinct advantages in capturing global information. Dosovitskiy et al. (2020) proposed a Vision Transformer (ViT) to adapt the self-attention mechanism to computer vision. Several researchers have applied transformer-based models for semantic segmentation using ViT. For instance, Wang et al. (2022b) employed a transformer architecture to develop a decoder, and Wang et al. (2022a) implemented a multipath network that extracts global context using ViT and spatial context through CNN.
Unlike CNN-based models, transformer-based models necessitate extensive datasets for training, which are often not readily available in the remote sensing domain. Consequently, transformer-based models are often pretrained and fined-tuned for specified applications. These pretrained models referred to as foundation models, are trained on large datasets comprising a substantial number of images to enhance their generalization capabilities. The Segment Anything Model (SAM) has increasingly found application in large-scale visual models, and it exhibits cutting-edge zero-shot capabilities in diverse settings by exploiting prompts (Kirillov et al., 2023). SAM is made up of three parts: an image encoder, a prompt encoder, and a mask decoder. Fig. 1 shows the SAM structure. Each part of SAM was pretrained using the SA-1B dataset, which contains more than 1 billion masks (Kirillov et al., 2023). A substantial portion of the dataset contains satellite images to handle its unique geometric properties.
The unique perspective of satellite images introduces geometric differences that limit the straightforward application of State-of-the-Art (SOTA) models to this type of data (Verde et al., 2018; Chen et al., 2024a). Additionally, since satellite images capture large areas, buildings often appear small within the images. It is important to note that SAM is known to have a bias regarding the size of the objects it extracts (Ren et al., 2024). Therefore, it is expected that preprocessing the input images before applying SAM could improve extraction performance. Therefore, these unique characteristics related to perspective and object scale necessitate the optimization of hyperparameters and image preprocessing to achieve better performance.
Various researchers have attempted to increase the performance of the SAM for satellite images. Chen et al. (2024a) proposed an RSPrompter for prompt learning to adapt suitable prompts for remote sensing images. In terms of fully automated SAM utilization, Khatua et al. (2024) applied You Only Look Once (YOLOv8) object detection algorithm to generate prompts. However, these methods do not enhance the mask generation capacity itself, as SAM remains unmodified. Zhou et al. (2024) proposed two enhancements and introduced a fine-tuned version of SAM named MeSAM. First, they added adapter layers between transformer layers in SAM’s image encoder to preserve high spatial data in the remote sensing images. Second, they restructured and fine-tuned the mask decoder with an additional dataset to accommodate more tokens. Nonetheless, adding a new structure compromises the already limited usability of SAM due to the large image encoder structure. Regarding the general usage of SAM, Ke et al. (2023) identified the irregular pattern generated occasionally and added a high-quality output token to address the low-resolution problem. The occurrence of this phenomenon can be exacerbated by the selection of hyperparameters. Therefore, methods to optimize the training of satellite images in SAM should focus on identifying hyperparameter combinations that facilitate effective training while mitigating these issues. In this study, we first aim to determine the optimal hyperparameters for training SAM on Korea Multi-Purpose Satellite (KOMPSAT) images.
Huang et al. (2024) integrated multiscale segmentation with SAM to address the challenge of segmenting complex farmland parcels in high-resolution satellite imagery. To effectively extract parcels of varying sizes, they generated pre-segmentation results at multiple scales and used these results as prompts for SAM. Mazurowski et al. (2023) applied SAM to medical images containing objects of various sizes, confirming that SAM’s performance is proportional to the average object size in the images. However, this study focused solely on analyzing SAM’s applicability and did not propose any methods for enhancing its performance.
Based on a comprehensive analysis of previous research, we identified key challenges in training related to differences in image perspectives, stability during the training process, and the varying sizes of target objects for extraction. Our objective is to determine the optimal hyperparameter combination for training SAM’s mask decoder, specifically focused on satellite images, to address these issues. During training, the image encoder and prompt encoder remain frozen. In this study, SAM was trained on a specific satellite image dataset, namely KOMPSAT. Therefore, the objective of this study is to optimize SAM’s hyperparameters for training and to propose the cropping and buffering method for KOMPSAT images.
SAM is a foundation model developed for general semantic segmentation tasks (Kirillov et al., 2023). It has been trained on the SA-1B dataset, which was specifically created for SAM and contains over one billion features. Since SAM is designed to be a robust model applicable to various types of images, the SA-1B dataset primarily consists of natural images, with a small portion of satellite images. However, natural images have geometric differences compared to satellite images, leading to SAM’s lower performance in building extraction tasks.
SAM is built with three parts: image encoder, prompt encoder, and mask decoder. Fig. 1 shows the structure of SAM. Both the image encoder and the mask decoder are made up of vision transformer layers. The image encoder contains over 636 million parameters, and it creates image embedding for the mask decoder. The mask decoder is a lightweight structure compared to the image encoder, containing 4 million parameters. For the same image, the embeddings generated by the image encoder can be reused for predicting masks of different objects. In contrast, the prompt encoder and mask decoder are unable to handle batched input and must be executed separately for each set of prompts.
The prompt is an additional input that can specify the object to be segmented. Sparse prompts such as points, box, and text and dense prompts such as masks can be used for SAM when each prompt is available. The prompt encoder represents the point and box prompts using positional encoding (Tancik et al., 2020). Multiple types of prompts can be used simultaneously if they indicate the same object within the image. Nonetheless, bounding box prompts are known to give the most precise annotation for SAM in building extraction (Ren et al., 2024). The features from the image encoder and prompt encoder are then concatenated and handled by the mask decoder to draw the final mask result and prediction score.
Since SAM requires prompting to segment a specific object, we preprocessed the building pixel coordinates in the label data to create a bounding box prompt for each building segment. We first calculated the maximum and minimum values of the building pixel coordinates to achieve the exact bounding box. To facilitate the generalized application of SAM, we applied a buffer area to bounding boxes when creating prompt inputs for two reasons. First, in most cases on the application of SAM, the exact bounding box coordinate will not be available. Second, in the dataset, the created label data have smaller dimensions than those shown in the image. Considering the spatial resolution of the image and the aforementioned reasons, a 10-pixel buffer was added when producing a bounding box prompt. Fig. 2 shows the preprocessing stage of creating a bounding box.
Ren et al. (2024) assessed SAM on various satellite datasets including Solar, 38-Cloud, DeepGlobe Roads, and SpaceNet 2. Additionally, they evaluated SAM’s performance on the DigitalGlobe Building and Inria Building datasets to focus on results related to building datasets. To dynamically analyze the impact of object scale on performance, they upscaled the input images by factors of 2, 4, and 8, and concluded that SAM’s performance can depend significantly on object scale. We applied a similar approach, which could be referred to as the kinetic cropping strategy, to conduct an in-depth analysis of the relationship between performance and scale. The kinetic cropping strategy involves cropping images one-to-one for bounding box prompts and using them as input. When the input image and bounding box prompt are given, the crop area is determined by selecting buffer pixels. We have tested 3 cases, by varying buffer pixels to 1, 50, and 100. Bilinear interpolation was used for all resizing. In Fig. 3, the preprocessing methods of cropping and buffering applied to the images can be observed.
We applied the adaptation method, a widely used method to improve model performance in deep learning (Gururangan et al., 2020). During training, only the parameters of the mask decoder were left trainable, while the other two components of SAM’s architecture were frozen. Four hyperparameters, optimizer, learning rate, batch size, and learning rate scheduler were tested to identify the optimal hyperparameter combination for building extraction with SAM.
For optimizers, Adaptive Moment Estimation (Adam; Kingma and Ba, 2014), Adam with decoupled Weight Decay (AdamW; Loshchilov and Frank, 2017), and Evolved Sign Momentum (Lion; Chen et al., 2024b) were tested. Adam and AdamW are optimization algorithms that use adaptive learning rates, but they differ in how they apply weight decay. In Adam, weight decay is applied directly during the update along with the learning rate. In AdamW, the parameter update is performed first, followed by the application of weight decay. This distinction allows AdamW to achieve more effective regularization. Lion optimizer does not average the gradients but instead tracks the direction of the gradients to achieve faster convergence. Instead of applying weight decay directly, it utilizes a technique that helps mitigate overfitting by maintaining a more stable trajectory during updates. Lion is believed to exhibit advantageous performance, especially in large-scale models.
The learning rate controls how much the model’s parameters are updated in each iteration. A higher learning rate can help the model avoid getting trapped in a local minimum and move toward the global minimum, but it may also cause convergence difficulty. Conversely, a lower learning rate can lead to convergence at local minima, increase the risk of overfitting, and slow down the convergence process. Batch size similarly impacts the training dynamics. A smaller batch size allows the model to learn unique cases better but may also increase the risk of overfitting. Therefore, selecting the appropriate learning rate and batch size is crucial for optimal performance. This is especially important for fine-tuning foundation models, where improper tuning can degrade pre-trained features and lead to lower performance.
KOMPSAT series observation data enable independent Earth observation in Korea. Therefore, various studies have been conducted to effectively and actively utilize data from the KOMPSAT series (Acharya et al., 2016; Kim and Kim, 2023). The KOMPSAT-3/3A sister satellites are used for VHR optical observation and are appropriate for building extraction. Both satellites use the same optical sensor, the only difference being the added infrared capability in KOMPSAT-3A. Moreover, the orbit’s altitude is different, creating differences in the Ground Sample Distance (GSD). KOMPSAT-3 orbits 625 km above sea level and captures images of 0.70 m GSD, whereas KOMPSAT-3A operates at an altitude of 528 km and GSD of 0.55 m (Committee on Earth Observation Satellites, 2024). Table 1 summarizes the specifications of KOMPSAT-3/3A satellites.
Table 1 Specification of KOMPSAT-3/3A satellites (Committee on Earth Observation Satellites, 2024)
Parameter | KOMPSAT-3 | KOMPSAT-3A |
---|---|---|
Launch data | May 18, 2012 | March 25, 2015 |
Ground sample distance | 0.70 m | 0.55 m |
Altitude | 625 km | 528 km |
Inclination | 98.13° | 97.513° |
Orbital period | 98.5 minutes | 95.2 minutes |
Number of revolutions | 14.6 revolutions/day | 15.1 revolutions/day |
Repeated ground track | 28 days / 423 revolutions | 28 days / 409 revolutions |
In this study, AI-Hub Satellite Image Building Boundary Detection Dataset version 1.0 which is made of KOMPSAT-3/3A images was used. The dataset comprises data acquired from images taken in four cities (Los Angeles in the USA, Shanghai in China, Wolfsburg in Germany, and New Cairo in Egypt) and captures a variety of building forms by country and continent in both raster and polygon format. It provides the building coordinates in both the longitude and latitude coordinate system and the image pixel coordinates. The type of building is expressed by color in the label image; however, we did not utilize this information in this study. This dataset features over 150,000 buildings in 1238 tiles of officially divided training data and 50,000 buildings in 159 tiles of validation data. Each tile is 1,024 × 1,024 in a 0.55–0.70 m resolution. The whole image was used without cropping it since SAM’s standard input size is 1,024 × 1,024. We used training data for training and validation and then the validation data for testing. Fig. 4 shows an example of the original image and label image of the dataset.
To identify the optimal hyperparameter combination for SAM, we conducted 27 experimental cases for each scenario, with and without a learning rate scheduler. These experiments varied the optimizer type, learning rate, and batch size, as these hyperparameters directly influence the training dynamics. The remaining training conditions, such as the loss function and batch size, were kept constant since they either affect the static objective of the training or do not significantly impact the training process. Each optimizer was tested with three learning rates: 5 × 10–8, 1 × 10–7, and 2 × 10–7, as well as three batch sizes: 4, 8, and 12. This resulted in nine different combinations for each optimizer. Maximum batch size was determined by the used GPU NVIDIA GeForce RTX 2080 Ti with 11 GB of memory. A batch size of 12 was the maximum batch size that the GPU could store the whole model and the embeddings. Smaller batch sizes of 1 or 2 were avoided due to the risk of overfitting. The learning rate candidate was chosen empirically based on several experimental trials. The test cases are listed in Table 2. When using the learning rate scheduler, the StepLR scheduler was used to reduce the learning rate by half every 2 epochs. When applying the Adam optimizer, an L2 regularization weight decay of 0.01 was chosen to prevent overfitting by adding a penalty term to the loss function. The same weight decay of 0.01 was applied to AdamW. Meanwhile, the default setting was used for Lion.
Table 2 Test case number for combinations of learning rate and batch size
Optimizers | Adam, AdamW, Lion | Scheduler | StepLR |
---|---|---|---|
Batch size | Learning rate | ||
5×10–8 | 1×10–7 | 2×10–7 | |
4 | Case 1 | Case 2 | Case 3 |
8 | Case 4 | Case 5 | Case 6 |
12 | Case 7 | Case 8 | Case 9 |
The case number applies identically for every optimizer tested.
Other hyperparameters, such as loss function and number of epochs were fixed across experiments. These hyperparameters were fixed to focus solely on analyzing the dynamic hyperparameters that influence the training process. Binary cross entropy (BCE) with logit loss was used to compute loss and update the parameters in the training stage. It computes loss in one combined class of a sigmoid layer and the BCE loss. BCE loss is the most used loss function for semantic segmentation. BCE with logit loss can compute loss with better stability than BCE loss by reducing the value using the log sum exp trick. The probability of prediction is calculated and transformed between 0 and 1 using the sigmoid layer. A threshold of 0.5 was used to determine whether each pixel is a member of the building. The unreduced loss can be defined as:
where yi and pi, respectively, denote the label and predicted probability of pixel i, and N is the number of pixels in the minibatch.
Using the Case 9 setting with each optimizer, we trained the SAM model for 100 epochs to observe the trend in loss value fluctuation. As shown in Fig. 5, the training loss reaches its minimum value before epoch 10 and does not decrease further. In some instances, after reaching this minimum, the loss tends to remain static or increase slightly. Based on this observation, it is evident that 10 epochs are sufficient and that increasing the epoch number does not improve model performance. Therefore, an epoch size of 10 was consistently used throughout all experiments without variation.
As mentioned earlier, SAM requires a single specific object to be segmented. Therefore, the output of SAM exhibits a single object in the whole image. To match the form, we created a single object label image from ground truth data and used it to measure loss. Fig. 6 shows an example image of the preprocessed label image for every building segment separately. Furthermore, an erosion kernel was applied to avoid collisions between building masks, as the white boundary in the original label image appears larger than the actual building boundary (see Fig. 6), leading to overlaps among the building masks.
F1 Score, Intersection over Union (IoU), and mean IoU (mIoU) were used to measure the performance of the trained models along with Precision, Recall, and Accuracy. These metrics are widely used in semantic segmentation tasks, and they are defined as follows:
where True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) denote the number of pixels classified as true positive, true negative, false positive, and false negative, respectively. Building and background pixels are denoted as positive and negative, respectively. IoU was measured as a whole patch prediction, where every building included in the image was merged, and mIoU was measured through each building segment prediction and then averaged. We defined the best hyperparameter set for SAM by comparing the preceding metrics.
To examine the training stability and convergence speed, we compared the loss value during training. Fig. 7 shows the mean epoch loss value of all hyperparameter setting cases for each optimizer type used. The epoch numbers are denoted from 0 to 9, and for each case, the mean epoch loss is represented by a solid line when the learning rate scheduler is not used, and a dotted line when the scheduler is applied. The same case setting is represented by the same color.
Using Adam or Lion as an optimizer without the learning rate scheduler caused a significantly unstable training process leading to a large loss value even higher than the initial state. Cases with AdamW don’t show this problem, and all case training loss converged to a lower value. The final convergence loss value also shows a difference. When using AdamW, all cases converged to values between 0.0004 and 0.0007, which is lower than any cases with other optimizers. The minimum loss for optimizers Adam and Lion were both 0.0011 at Case 7 without the learning rate scheduler.
The usage of a learning rate scheduler has shown a crucial effect on training stability, especially for optimizers Adam and Lion. Without a scheduler, the divergence slope became higher. Cases 2, 3, 6, and 9 exhibited a stronger tendency. This trend was more pronounced in cases with a larger learning rate and smaller batch size showing commonality throughout the cases mentioned above. Still, all cases were initially successful in decreasing the loss value and reaching a minimum value of around 0.0012. However, when using the scheduler, some cases have failed to converge at minimum value. Therefore, model performance can vary regardless of whether train loss converges.
When using the AdamW optimizer, the training process is largely unaffected by changes in learning rate and batch size. The only difference observed was in the convergence speed; with a larger learning rate and smaller batch size, the number of iterations increased, leading to faster convergence. Furthermore, when a scheduler was used, the difference in trend diminished. Adam and Lion optimizer cases showed a similar trend with changes in learning rate and batch size. Larger batch size and smaller learning rate helped training to be more stable, while the opposite case—smaller batch size and larger learning rate—tended to introduce more variability and instability during training. It was observed that the batch size had a greater impact, likely because the number of iterations is determined by the batch size.
To analyze the performance of the trained models, we tested the models on the test dataset with both the latest model and the model with the lowest mean epoch loss. Both quantitative and qualitative results are evaluated to determine the best hyperparameter combination.
Quantitative comparisons were conducted, as shown in Fig. 8. For each optimizer, the metric scores of the best-performing cases are presented in Table 2, with the highest score highlighted in bold. From Fig. 8, it can be observed that all trained cases show better F1 Scores than the original SAM marked with a red dotted line.
Table 2 Quantitative metrics score results with the largest value bolded
Optimizer | Scheduler | Case no. | F1 Score | IoU | mIoU | Precision | Recall | Accuracy |
---|---|---|---|---|---|---|---|---|
Original SAM | 0.72763 | 0.58943 | 0.65720 | 0.60019 | 0.97398 | 0.89239 | ||
Adam | O | 9 | 0.75148 | 0.62326 | 0.74637 | 0.64869 | 0.93911 | 0.90998 |
X | 8 | 0.75644 | 0.62366 | 0.74782 | 0.64801 | 0.94688 | 0.91008 | |
AdamW | O | 6 | 0.77668 | 0.65535 | 0.86478 | 0.66284 | 0.97814 | 0.91588 |
X | 6 | 0.77835 | 0.65148 | 0.84112 | 0.65961 | 0.98136 | 0.91581 | |
Lion | O | 9 | 0.75164 | 0.62356 | 0.74719 | 0.64842 | 0.94005 | 0.91008 |
X | 8 | 0.75630 | 0.62345 | 0.74776 | 0.64789 | 0.94674 | 0.90998 |
All the best metrics achieved were with the cases trained by AdamW. When using AdamW, the trained model showed a higher F1 Score of 0.05, regardless of the scheduler used. However, when a scheduler was used, it was observed that the metric values were very closely clustered, which can be interpreted as indicating more consistent training. In the cases of Adam and Lion, the increase in the F1 Score varied depending on the use of a scheduler. When a scheduler was used, the increase was approximately 0.024, while without it, the increase was about 0.029.
The influence of batch size showcased consistent results across all cases, with larger batch sizes resulting in better performance. However, the impact of changes in the learning rate differed between AdamW and the other two optimizers, Adam and Lion. For AdamW, larger learning rates led to better performance, whereas for Adam and Lion, smaller learning rates produced better results. The best F1 Score for Adam was 0.7564 in Case 8, and the worst was 0.74016 in Case 3, both without using a learning rate scheduler. Similarly for Lion, the best F1 Score was 0.7563 in Case 8, and the worst was 0.7438 in Case 3, also without using a scheduler.
The original SAM model generated masks that were larger than the actual building sizes, resulting in overlapping areas between the masks. This creates an ambiguous boundary area between the segments, and the exact boundary of the building cannot be known. The trained models exhibited this characteristic to a lesser extent compared to the original SAM, allowing them to better distinguish the nearby building segments. Fig. 9 features the mask generation results from all 27 models trained using the learning rate scheduler, and Fig. 10 features select mask generation results from the mask prediction of 159 validation data images. It was observed that the models were trained to generate more distinct masks between objects in the order of Adam, Lion, and AdamW. This is consistent with the observation that the recall metric for AdamW is higher.
As shown in the results from Table 2, both Adam and Lion achieved the highest performance in Case 9 and Case 8, respectively, when using and not using a scheduler. In contrast, AdamW showed the best performance in Case 6. However, for AdamW, the variation between cases was minimal, suggesting that selecting a specific case as the best by the evaluation metrics does not hold significant value.
In the case of Adam and Lion, the impact of learning rate and batch size was more pronounced. As the learning rate increased, prediction stability decreased, particularly affecting boundary expression leading to degraded performance. Boundaries became wiggly, and grid-like dots appeared inside the masks. This phenomenon can be attributed to the original SAM’s tendency to predict larger objects, a key difference between natural and satellite images. During training, SAM predicts uniquely shaped building segments over much larger areas, resulting in high losses and large parameter updates when predicting smaller buildings.
As a result, the model became significantly downgraded and failed to predict smaller buildings creating grid-patterned gaps in the masks. Additionally, high learning rates caused unstable boundary shapes, producing complex masks. Therefore, a lower learning rate appears preferable for fine-tuning SAM, as it helps mitigate these issues. With a lower learning rate, parameter updates remain small when mask prediction sharply fails and preventing the model from deteriorating. However, the model training process takes a longer time to converge. With a higher learning rate, trained models performed better in predicting uniquely shaped buildings but failed to predict larger and nonuniquely shaped buildings. As a result, a lower learning rate of 5 × 10–8 yielded better results.
Cases with larger batch sizes exhibited more accurate mask prediction results compared to those with smaller batch sizes. When predicting images with densely placed buildings, cases with smaller batch sizes tend to predict larger masks. This caused building masks to collide and appear as a single structure, similar to the original SAM. Training the model with a larger batch size partially mitigated this issue. However, models trained with AdamW were not able to improve this problem.
In the prediction results obtained by the original model and all the trained models, it was observed that patterned gaps appeared within the building masks. The examples of the inaccuracies are shown in Fig. 11. These inaccuracies cause not only low metrics but also complex boundaries of the building. The degree of this effect varied depending on the combinations of hyperparameters used. When using AdamW as the optimizer, the occurrence of patterns was less frequent. However, in the case of Adam and Lion, a significant increase in both the frequency and area of pattern occurrences was observed in specific cases. Cases 2, 3, and 6 had a higher rate of such errors, with case 3 in particular showing that the mask generation process itself did not proceed correctly.
We have tested 3 cases of simply upscaling the image and 3 cases of kinetically upscaling the image with different buffer sizes. The evaluation metric scores are shown in Figs. 12 and 13 and the example of mask prediction is shown in Figs. 14 and 15.
To examine the relationship between SAM’s performance and object size, we adopted a similar approach to Ren et al. (2024). We first evaluated the segmentation performance on images that had been upscaled by a factor of 2n (n=1,2,3). All trained models were evaluated using upscaled test images. Among the different optimizer types, Case 6 demonstrated the highest performance with AdamW, while Case 9 achieved the best performance with either Adam or Lion as the optimizer. A comparison was made with the models that achieved the highest performance for each optimizer. When the images were upscaled by a factor of 2 and 4, both metrics showed improved performance compared to the original scale across all three trained models.
However, when upscaled by a factor of 8, performance noticeably decreased. For the original model, the F1 Score improved to 0.7641 from 0.7276 (an increase of 0.0365) when the image was upscaled by a factor of 2, while the IoU increased significantly to 0.6393 from 0.5894 (an increase of 0.0499). Similarly, for the model trained using AdamW, the F1 Score increased to 0.7909 from 0.7767, and the IoU rose to 0.6751 from 0.6553, outperforming all previously trained models. Across all optimizers, the best performance was observed when the image scale was between 2x and 4x.
As the image scale increases, the model tends to misclassify nonbuilding areas. Additionally, as shown in Fig. 12, the prediction performance along the boundaries deteriorated during the process of reassembling the images that had been divided into grids. When upscaling was applied, it was observed that the performance improvement of the trained model was more limited compared to the original model.
Next, we evaluated the original model and the model trained with AdamW under the Case 6 setting by varying the buffer size. The buffer size determines the extent of the building included in the image. Three buffer sizes—1 pixel, 50 pixels, and 100 pixels— were assessed. Both the original model and the model trained with AdamW demonstrated improved performance in the order of applying buffers of 100 pixels, 50 pixels, and 1 pixel around the objects. However, similarly to the simple upscaling results, as the buffer size increased, the performance gap between the trained model and the original model decreased. With the kinetic cropping, the F1 Score reached 0.8694 from 0.7515 increments of 0.1179, when using the AdamW-trained model Case 6. For the original SAM model, the F1 Score has improved to 0.8499 from 0.7276 and IoU has increased to 0.5894 from 0.7438.
Visual comparisons also demonstrated an improvement in mask generation performance. When predicting areas with densely located buildings, it was observed that using the full image led to more accurate extraction of building boundaries, enabling the mask outputs to be generated without overlapping. However, the patterned inaccuracies tend to appear more in the trained models. Figs. 14 and 15 illustrate the generated prediction masks from various upscaling methods applied.
In this study, we optimized the application of the SAM model specializing it for satellite images through two primary approaches, given that SAM is predominantly pretrained on natural images. First, optimizing the hyperparameters allows SAM to improve segmentation quality on satellite images, specifically those from the KOMPSAT dataset in this study. When training SAM, the problem of occasional prediction of results with gap should be considered. It is crucial to develop strategies that minimize the occurrence of this phenomenon while enhancing building segmentation performance. Second, we applied SAM to images that were cropped object-by-object and buffered to further enhance performance.
Initially, we conducted a comparative analysis of the loss variation during training and the predictive performance of the trained models based on different hyperparameter combinations. It was confirmed that performance improved most significantly when using the AdamW optimizer. Specifically, the F1 Score increased from 0.7276 to 0.7767, a rise of 0.0491, while the IoU improved from 0.5894 to 0.6554, a gain of 0.066. When using AdamW, there were no significant differences related to learning rate and batch size. However, it was observed that even with a large batch size and high learning rate, stable and fast training was achieved when applying a learning rate scheduler during the training process. Additionally, it was confirmed that training could be conducted without increasing the occurrence of inaccuracies during the learning process, following the proposed hyperparameter settings.
To further examine the impact of image scaling, we compared performance when images were scaled by factors of 2x, 4x, and 8x, as well as when kinetic cropping and adding buffer were applied to objects. The results showed that applying kinetic cropping combined with a 100-pixel buffer on a model trained with AdamW led to a significant increase in the F1 Score, rising from 0.7767 to 0.8694, a substantial improvement of 0.1179. Performance enhancements visually confirmed that exhibiting images with densely populated buildings can be extracted without overlapping.
Despite these advancements, this study has certain limitations. There are still instances where SAM creates inaccuracies within the mask during mask generation. This issue is addressable through the application of HQ-SAM (Ke et al., 2023). Additionally, within a single epoch, the loss value decreased rapidly. Adjusting the dataset size will allow the loss value to either continuously decrease or remain stable over larger epochs. Moreover, removing items from the dataset that significantly reduce the learning effect is likely to enhance the effectiveness of training. Further research can explore improving the learning effect through boundary-aware learning during training, considering the direction of building segments and adjusting the image orientation.
This work is supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2022-00155763), and Korea Ministry of Land, Infrastructure and Transport (MOLIT) as 「Innovative Talent Education Program for Smart City」. The Institute of Engineering Research at Seoul National University provided research facilities for this work. This research (paper) used datasets from ‘The Open AI Dataset Project (AI-Hub, S. Korea)’. All data information can be accessed through ‘AI-Hub (www.aihub.or.kr)’.
No potential conflict of interest relevant to this article was reported.
Korean J. Remote Sens. 2024; 40(5): 551-568
Published online October 31, 2024 https://doi.org/10.7780/kjrs.2024.40.5.1.11
Copyright © Korean Society of Remote Sensing.
Donghyeon Lee1 , Jiyong Kim2 , Yongil Kim3*
1Master Student, Department of Civil and Environmental Engineering, Seoul National University, Seoul, Republic of Korea
2Researcher, Institute of Engineering Research, Seoul National University, Seoul, Republic of Korea
3Professor, Department of Civil and Environmental Engineering, Seoul National University, Seoul, Republic of Korea
Correspondence to:Yongil Kim
E-mail: yik@snu.ac.kr
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Extracting building information from Very-High-Resolution (VHR) satellite images is critical for urban mapping and monitoring. Traditional manual annotation methods are labor-intensive and costly, making automated solutions highly desirable. Segment Anything Model (SAM), a foundation model trained mostly on natural images, has recently shown high performance on diverse segmentation tasks. However, due to differences in perspective and the average size of objects in the images, SAM exhibits lower performance when extracting buildings from satellite imagery. These limitations, derived from differences in image domains, can be addressed by fine-tuning the model with satellite images and preprocessing the input images. However, various hyperparameters, such as learning rate, batch size, and optimizer type, deeply impact the performance of the fine-tuned model, and thus, in-depth investigations on these hyperparameters are critical for model adaptation. To identify the optimal hyperparameter configuration, we conducted extensive experiments with combinations of hyperparameter settings using Korea Multi-Purpose Satellite (KOMPSAT) images. Additionally, various upscaling methods and object-by-object preprocessing techniques were compared and evaluated, leading to the proposal of an effective preprocessing approach. With the optimal combination, an F1 Score of 0.862, an Intersection over Union (IoU) of 0.761, and a mean IoU (mIoU) of 0.705 were achieved using AdamW optimizer, object-by-object cropping, and 100-pixel buffering. The proposed hyperparameter optimization method in our research underscores the effectiveness of fine-tuning SAM for accurate building extraction in VHR satellite imagery, thereby enabling more reliable data interpretation and decision-making processes in automated remote sensing applications.
Keywords: Building extraction, Segment anything model, Hyperparameter optimization, KOMPSAT
Extracting accurate spatial information on buildings is critical in urban planning, disaster monitoring, agricultural production prediction, and population estimation. Furthermore, the increase in very high-resolution satellite imagery has opened up new opportunities for extracting accurate building information. Accordingly, semantic segmentation (Xu et al., 2018), which involves annotating an image at the pixel level, is considered one of the most important subtasks in the analysis of building spatial information from optical satellite imageries.
Manual annotation of buildings requires an excessive amount of human intervention, which limits its practicality and scalability (Zheng et al., 2023). Moreover, the complex scenes that appear in satellite images, including the diversity of building types, shapes, sizes, and colors, greatly limit automated building extraction. Furthermore, buildings are usually placed closer to other buildings or roads, which makes it hard for machines to distinguish the boundaries of various objects. Considering the ability of Deep Neural Networks (DNNs) to incorporate both shallow and deep features into segmentation, the use of deep learning methods for automated building extraction has been investigated.
Convolutional Neural Networks (CNNs), which incorporate convolution layers on DNNs to deal with two-dimensional image data, have demonstrated exceptional ability in building segmentation (Krizhevsky et al., 2017). Therefore, Fully Convolutional Neural Networks (FCNs), an extension of CNN, were proposed to address arbitrary-sized images (Long et al., 2016). However, FCNs generate coarse-level building prediction. Two approaches were proposed to overcome this problem. The first approach involves using U-shaped encoder-decoder structures, such as U-Net (Ronneberger et al., 2015) and SegNet (Badrinarayanan et al., 2015) by constructing symmetrical decoders to encoders. The encoder captures features from the image and represents them in high-dimensional features, and the decoder restores the fine-resolution data from both high-dimensional features and low-dimensional features by using skip connections. Another approach is constructing multiscale networks, such as HRNet (Sun et al., 2019). By repeatedly fusing multiscale data, high-resolution features can be preserved.
However, CNN-based models suffer from drawbacks such as a high computation burden on pixel-level classification (Guo et al., 2018). Furthermore, CNN-based models have limitations in elaborating global information. In contrast, transformer architectures that utilize attention mechanisms (Vaswani et al., 2017) offer distinct advantages in capturing global information. Dosovitskiy et al. (2020) proposed a Vision Transformer (ViT) to adapt the self-attention mechanism to computer vision. Several researchers have applied transformer-based models for semantic segmentation using ViT. For instance, Wang et al. (2022b) employed a transformer architecture to develop a decoder, and Wang et al. (2022a) implemented a multipath network that extracts global context using ViT and spatial context through CNN.
Unlike CNN-based models, transformer-based models necessitate extensive datasets for training, which are often not readily available in the remote sensing domain. Consequently, transformer-based models are often pretrained and fined-tuned for specified applications. These pretrained models referred to as foundation models, are trained on large datasets comprising a substantial number of images to enhance their generalization capabilities. The Segment Anything Model (SAM) has increasingly found application in large-scale visual models, and it exhibits cutting-edge zero-shot capabilities in diverse settings by exploiting prompts (Kirillov et al., 2023). SAM is made up of three parts: an image encoder, a prompt encoder, and a mask decoder. Fig. 1 shows the SAM structure. Each part of SAM was pretrained using the SA-1B dataset, which contains more than 1 billion masks (Kirillov et al., 2023). A substantial portion of the dataset contains satellite images to handle its unique geometric properties.
The unique perspective of satellite images introduces geometric differences that limit the straightforward application of State-of-the-Art (SOTA) models to this type of data (Verde et al., 2018; Chen et al., 2024a). Additionally, since satellite images capture large areas, buildings often appear small within the images. It is important to note that SAM is known to have a bias regarding the size of the objects it extracts (Ren et al., 2024). Therefore, it is expected that preprocessing the input images before applying SAM could improve extraction performance. Therefore, these unique characteristics related to perspective and object scale necessitate the optimization of hyperparameters and image preprocessing to achieve better performance.
Various researchers have attempted to increase the performance of the SAM for satellite images. Chen et al. (2024a) proposed an RSPrompter for prompt learning to adapt suitable prompts for remote sensing images. In terms of fully automated SAM utilization, Khatua et al. (2024) applied You Only Look Once (YOLOv8) object detection algorithm to generate prompts. However, these methods do not enhance the mask generation capacity itself, as SAM remains unmodified. Zhou et al. (2024) proposed two enhancements and introduced a fine-tuned version of SAM named MeSAM. First, they added adapter layers between transformer layers in SAM’s image encoder to preserve high spatial data in the remote sensing images. Second, they restructured and fine-tuned the mask decoder with an additional dataset to accommodate more tokens. Nonetheless, adding a new structure compromises the already limited usability of SAM due to the large image encoder structure. Regarding the general usage of SAM, Ke et al. (2023) identified the irregular pattern generated occasionally and added a high-quality output token to address the low-resolution problem. The occurrence of this phenomenon can be exacerbated by the selection of hyperparameters. Therefore, methods to optimize the training of satellite images in SAM should focus on identifying hyperparameter combinations that facilitate effective training while mitigating these issues. In this study, we first aim to determine the optimal hyperparameters for training SAM on Korea Multi-Purpose Satellite (KOMPSAT) images.
Huang et al. (2024) integrated multiscale segmentation with SAM to address the challenge of segmenting complex farmland parcels in high-resolution satellite imagery. To effectively extract parcels of varying sizes, they generated pre-segmentation results at multiple scales and used these results as prompts for SAM. Mazurowski et al. (2023) applied SAM to medical images containing objects of various sizes, confirming that SAM’s performance is proportional to the average object size in the images. However, this study focused solely on analyzing SAM’s applicability and did not propose any methods for enhancing its performance.
Based on a comprehensive analysis of previous research, we identified key challenges in training related to differences in image perspectives, stability during the training process, and the varying sizes of target objects for extraction. Our objective is to determine the optimal hyperparameter combination for training SAM’s mask decoder, specifically focused on satellite images, to address these issues. During training, the image encoder and prompt encoder remain frozen. In this study, SAM was trained on a specific satellite image dataset, namely KOMPSAT. Therefore, the objective of this study is to optimize SAM’s hyperparameters for training and to propose the cropping and buffering method for KOMPSAT images.
SAM is a foundation model developed for general semantic segmentation tasks (Kirillov et al., 2023). It has been trained on the SA-1B dataset, which was specifically created for SAM and contains over one billion features. Since SAM is designed to be a robust model applicable to various types of images, the SA-1B dataset primarily consists of natural images, with a small portion of satellite images. However, natural images have geometric differences compared to satellite images, leading to SAM’s lower performance in building extraction tasks.
SAM is built with three parts: image encoder, prompt encoder, and mask decoder. Fig. 1 shows the structure of SAM. Both the image encoder and the mask decoder are made up of vision transformer layers. The image encoder contains over 636 million parameters, and it creates image embedding for the mask decoder. The mask decoder is a lightweight structure compared to the image encoder, containing 4 million parameters. For the same image, the embeddings generated by the image encoder can be reused for predicting masks of different objects. In contrast, the prompt encoder and mask decoder are unable to handle batched input and must be executed separately for each set of prompts.
The prompt is an additional input that can specify the object to be segmented. Sparse prompts such as points, box, and text and dense prompts such as masks can be used for SAM when each prompt is available. The prompt encoder represents the point and box prompts using positional encoding (Tancik et al., 2020). Multiple types of prompts can be used simultaneously if they indicate the same object within the image. Nonetheless, bounding box prompts are known to give the most precise annotation for SAM in building extraction (Ren et al., 2024). The features from the image encoder and prompt encoder are then concatenated and handled by the mask decoder to draw the final mask result and prediction score.
Since SAM requires prompting to segment a specific object, we preprocessed the building pixel coordinates in the label data to create a bounding box prompt for each building segment. We first calculated the maximum and minimum values of the building pixel coordinates to achieve the exact bounding box. To facilitate the generalized application of SAM, we applied a buffer area to bounding boxes when creating prompt inputs for two reasons. First, in most cases on the application of SAM, the exact bounding box coordinate will not be available. Second, in the dataset, the created label data have smaller dimensions than those shown in the image. Considering the spatial resolution of the image and the aforementioned reasons, a 10-pixel buffer was added when producing a bounding box prompt. Fig. 2 shows the preprocessing stage of creating a bounding box.
Ren et al. (2024) assessed SAM on various satellite datasets including Solar, 38-Cloud, DeepGlobe Roads, and SpaceNet 2. Additionally, they evaluated SAM’s performance on the DigitalGlobe Building and Inria Building datasets to focus on results related to building datasets. To dynamically analyze the impact of object scale on performance, they upscaled the input images by factors of 2, 4, and 8, and concluded that SAM’s performance can depend significantly on object scale. We applied a similar approach, which could be referred to as the kinetic cropping strategy, to conduct an in-depth analysis of the relationship between performance and scale. The kinetic cropping strategy involves cropping images one-to-one for bounding box prompts and using them as input. When the input image and bounding box prompt are given, the crop area is determined by selecting buffer pixels. We have tested 3 cases, by varying buffer pixels to 1, 50, and 100. Bilinear interpolation was used for all resizing. In Fig. 3, the preprocessing methods of cropping and buffering applied to the images can be observed.
We applied the adaptation method, a widely used method to improve model performance in deep learning (Gururangan et al., 2020). During training, only the parameters of the mask decoder were left trainable, while the other two components of SAM’s architecture were frozen. Four hyperparameters, optimizer, learning rate, batch size, and learning rate scheduler were tested to identify the optimal hyperparameter combination for building extraction with SAM.
For optimizers, Adaptive Moment Estimation (Adam; Kingma and Ba, 2014), Adam with decoupled Weight Decay (AdamW; Loshchilov and Frank, 2017), and Evolved Sign Momentum (Lion; Chen et al., 2024b) were tested. Adam and AdamW are optimization algorithms that use adaptive learning rates, but they differ in how they apply weight decay. In Adam, weight decay is applied directly during the update along with the learning rate. In AdamW, the parameter update is performed first, followed by the application of weight decay. This distinction allows AdamW to achieve more effective regularization. Lion optimizer does not average the gradients but instead tracks the direction of the gradients to achieve faster convergence. Instead of applying weight decay directly, it utilizes a technique that helps mitigate overfitting by maintaining a more stable trajectory during updates. Lion is believed to exhibit advantageous performance, especially in large-scale models.
The learning rate controls how much the model’s parameters are updated in each iteration. A higher learning rate can help the model avoid getting trapped in a local minimum and move toward the global minimum, but it may also cause convergence difficulty. Conversely, a lower learning rate can lead to convergence at local minima, increase the risk of overfitting, and slow down the convergence process. Batch size similarly impacts the training dynamics. A smaller batch size allows the model to learn unique cases better but may also increase the risk of overfitting. Therefore, selecting the appropriate learning rate and batch size is crucial for optimal performance. This is especially important for fine-tuning foundation models, where improper tuning can degrade pre-trained features and lead to lower performance.
KOMPSAT series observation data enable independent Earth observation in Korea. Therefore, various studies have been conducted to effectively and actively utilize data from the KOMPSAT series (Acharya et al., 2016; Kim and Kim, 2023). The KOMPSAT-3/3A sister satellites are used for VHR optical observation and are appropriate for building extraction. Both satellites use the same optical sensor, the only difference being the added infrared capability in KOMPSAT-3A. Moreover, the orbit’s altitude is different, creating differences in the Ground Sample Distance (GSD). KOMPSAT-3 orbits 625 km above sea level and captures images of 0.70 m GSD, whereas KOMPSAT-3A operates at an altitude of 528 km and GSD of 0.55 m (Committee on Earth Observation Satellites, 2024). Table 1 summarizes the specifications of KOMPSAT-3/3A satellites.
Table 1 . Specification of KOMPSAT-3/3A satellites (Committee on Earth Observation Satellites, 2024).
Parameter | KOMPSAT-3 | KOMPSAT-3A |
---|---|---|
Launch data | May 18, 2012 | March 25, 2015 |
Ground sample distance | 0.70 m | 0.55 m |
Altitude | 625 km | 528 km |
Inclination | 98.13° | 97.513° |
Orbital period | 98.5 minutes | 95.2 minutes |
Number of revolutions | 14.6 revolutions/day | 15.1 revolutions/day |
Repeated ground track | 28 days / 423 revolutions | 28 days / 409 revolutions |
In this study, AI-Hub Satellite Image Building Boundary Detection Dataset version 1.0 which is made of KOMPSAT-3/3A images was used. The dataset comprises data acquired from images taken in four cities (Los Angeles in the USA, Shanghai in China, Wolfsburg in Germany, and New Cairo in Egypt) and captures a variety of building forms by country and continent in both raster and polygon format. It provides the building coordinates in both the longitude and latitude coordinate system and the image pixel coordinates. The type of building is expressed by color in the label image; however, we did not utilize this information in this study. This dataset features over 150,000 buildings in 1238 tiles of officially divided training data and 50,000 buildings in 159 tiles of validation data. Each tile is 1,024 × 1,024 in a 0.55–0.70 m resolution. The whole image was used without cropping it since SAM’s standard input size is 1,024 × 1,024. We used training data for training and validation and then the validation data for testing. Fig. 4 shows an example of the original image and label image of the dataset.
To identify the optimal hyperparameter combination for SAM, we conducted 27 experimental cases for each scenario, with and without a learning rate scheduler. These experiments varied the optimizer type, learning rate, and batch size, as these hyperparameters directly influence the training dynamics. The remaining training conditions, such as the loss function and batch size, were kept constant since they either affect the static objective of the training or do not significantly impact the training process. Each optimizer was tested with three learning rates: 5 × 10–8, 1 × 10–7, and 2 × 10–7, as well as three batch sizes: 4, 8, and 12. This resulted in nine different combinations for each optimizer. Maximum batch size was determined by the used GPU NVIDIA GeForce RTX 2080 Ti with 11 GB of memory. A batch size of 12 was the maximum batch size that the GPU could store the whole model and the embeddings. Smaller batch sizes of 1 or 2 were avoided due to the risk of overfitting. The learning rate candidate was chosen empirically based on several experimental trials. The test cases are listed in Table 2. When using the learning rate scheduler, the StepLR scheduler was used to reduce the learning rate by half every 2 epochs. When applying the Adam optimizer, an L2 regularization weight decay of 0.01 was chosen to prevent overfitting by adding a penalty term to the loss function. The same weight decay of 0.01 was applied to AdamW. Meanwhile, the default setting was used for Lion.
Table 2 . Test case number for combinations of learning rate and batch size.
Optimizers | Adam, AdamW, Lion | Scheduler | StepLR |
---|---|---|---|
Batch size | Learning rate | ||
5×10–8 | 1×10–7 | 2×10–7 | |
4 | Case 1 | Case 2 | Case 3 |
8 | Case 4 | Case 5 | Case 6 |
12 | Case 7 | Case 8 | Case 9 |
The case number applies identically for every optimizer tested..
Other hyperparameters, such as loss function and number of epochs were fixed across experiments. These hyperparameters were fixed to focus solely on analyzing the dynamic hyperparameters that influence the training process. Binary cross entropy (BCE) with logit loss was used to compute loss and update the parameters in the training stage. It computes loss in one combined class of a sigmoid layer and the BCE loss. BCE loss is the most used loss function for semantic segmentation. BCE with logit loss can compute loss with better stability than BCE loss by reducing the value using the log sum exp trick. The probability of prediction is calculated and transformed between 0 and 1 using the sigmoid layer. A threshold of 0.5 was used to determine whether each pixel is a member of the building. The unreduced loss can be defined as:
where yi and pi, respectively, denote the label and predicted probability of pixel i, and N is the number of pixels in the minibatch.
Using the Case 9 setting with each optimizer, we trained the SAM model for 100 epochs to observe the trend in loss value fluctuation. As shown in Fig. 5, the training loss reaches its minimum value before epoch 10 and does not decrease further. In some instances, after reaching this minimum, the loss tends to remain static or increase slightly. Based on this observation, it is evident that 10 epochs are sufficient and that increasing the epoch number does not improve model performance. Therefore, an epoch size of 10 was consistently used throughout all experiments without variation.
As mentioned earlier, SAM requires a single specific object to be segmented. Therefore, the output of SAM exhibits a single object in the whole image. To match the form, we created a single object label image from ground truth data and used it to measure loss. Fig. 6 shows an example image of the preprocessed label image for every building segment separately. Furthermore, an erosion kernel was applied to avoid collisions between building masks, as the white boundary in the original label image appears larger than the actual building boundary (see Fig. 6), leading to overlaps among the building masks.
F1 Score, Intersection over Union (IoU), and mean IoU (mIoU) were used to measure the performance of the trained models along with Precision, Recall, and Accuracy. These metrics are widely used in semantic segmentation tasks, and they are defined as follows:
where True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) denote the number of pixels classified as true positive, true negative, false positive, and false negative, respectively. Building and background pixels are denoted as positive and negative, respectively. IoU was measured as a whole patch prediction, where every building included in the image was merged, and mIoU was measured through each building segment prediction and then averaged. We defined the best hyperparameter set for SAM by comparing the preceding metrics.
To examine the training stability and convergence speed, we compared the loss value during training. Fig. 7 shows the mean epoch loss value of all hyperparameter setting cases for each optimizer type used. The epoch numbers are denoted from 0 to 9, and for each case, the mean epoch loss is represented by a solid line when the learning rate scheduler is not used, and a dotted line when the scheduler is applied. The same case setting is represented by the same color.
Using Adam or Lion as an optimizer without the learning rate scheduler caused a significantly unstable training process leading to a large loss value even higher than the initial state. Cases with AdamW don’t show this problem, and all case training loss converged to a lower value. The final convergence loss value also shows a difference. When using AdamW, all cases converged to values between 0.0004 and 0.0007, which is lower than any cases with other optimizers. The minimum loss for optimizers Adam and Lion were both 0.0011 at Case 7 without the learning rate scheduler.
The usage of a learning rate scheduler has shown a crucial effect on training stability, especially for optimizers Adam and Lion. Without a scheduler, the divergence slope became higher. Cases 2, 3, 6, and 9 exhibited a stronger tendency. This trend was more pronounced in cases with a larger learning rate and smaller batch size showing commonality throughout the cases mentioned above. Still, all cases were initially successful in decreasing the loss value and reaching a minimum value of around 0.0012. However, when using the scheduler, some cases have failed to converge at minimum value. Therefore, model performance can vary regardless of whether train loss converges.
When using the AdamW optimizer, the training process is largely unaffected by changes in learning rate and batch size. The only difference observed was in the convergence speed; with a larger learning rate and smaller batch size, the number of iterations increased, leading to faster convergence. Furthermore, when a scheduler was used, the difference in trend diminished. Adam and Lion optimizer cases showed a similar trend with changes in learning rate and batch size. Larger batch size and smaller learning rate helped training to be more stable, while the opposite case—smaller batch size and larger learning rate—tended to introduce more variability and instability during training. It was observed that the batch size had a greater impact, likely because the number of iterations is determined by the batch size.
To analyze the performance of the trained models, we tested the models on the test dataset with both the latest model and the model with the lowest mean epoch loss. Both quantitative and qualitative results are evaluated to determine the best hyperparameter combination.
Quantitative comparisons were conducted, as shown in Fig. 8. For each optimizer, the metric scores of the best-performing cases are presented in Table 2, with the highest score highlighted in bold. From Fig. 8, it can be observed that all trained cases show better F1 Scores than the original SAM marked with a red dotted line.
Table 2. Quantitative metrics score results with the largest value bolded.
Optimizer | Scheduler | Case no. | F1 Score | IoU | mIoU | Precision | Recall | Accuracy |
---|---|---|---|---|---|---|---|---|
Original SAM | 0.72763 | 0.58943 | 0.65720 | 0.60019 | 0.97398 | 0.89239 | ||
Adam | O | 9 | 0.75148 | 0.62326 | 0.74637 | 0.64869 | 0.93911 | 0.90998 |
X | 8 | 0.75644 | 0.62366 | 0.74782 | 0.64801 | 0.94688 | 0.91008 | |
AdamW | O | 6 | 0.77668 | 0.65535 | 0.86478 | 0.66284 | 0.97814 | 0.91588 |
X | 6 | 0.77835 | 0.65148 | 0.84112 | 0.65961 | 0.98136 | 0.91581 | |
Lion | O | 9 | 0.75164 | 0.62356 | 0.74719 | 0.64842 | 0.94005 | 0.91008 |
X | 8 | 0.75630 | 0.62345 | 0.74776 | 0.64789 | 0.94674 | 0.90998 |
All the best metrics achieved were with the cases trained by AdamW. When using AdamW, the trained model showed a higher F1 Score of 0.05, regardless of the scheduler used. However, when a scheduler was used, it was observed that the metric values were very closely clustered, which can be interpreted as indicating more consistent training. In the cases of Adam and Lion, the increase in the F1 Score varied depending on the use of a scheduler. When a scheduler was used, the increase was approximately 0.024, while without it, the increase was about 0.029.
The influence of batch size showcased consistent results across all cases, with larger batch sizes resulting in better performance. However, the impact of changes in the learning rate differed between AdamW and the other two optimizers, Adam and Lion. For AdamW, larger learning rates led to better performance, whereas for Adam and Lion, smaller learning rates produced better results. The best F1 Score for Adam was 0.7564 in Case 8, and the worst was 0.74016 in Case 3, both without using a learning rate scheduler. Similarly for Lion, the best F1 Score was 0.7563 in Case 8, and the worst was 0.7438 in Case 3, also without using a scheduler.
The original SAM model generated masks that were larger than the actual building sizes, resulting in overlapping areas between the masks. This creates an ambiguous boundary area between the segments, and the exact boundary of the building cannot be known. The trained models exhibited this characteristic to a lesser extent compared to the original SAM, allowing them to better distinguish the nearby building segments. Fig. 9 features the mask generation results from all 27 models trained using the learning rate scheduler, and Fig. 10 features select mask generation results from the mask prediction of 159 validation data images. It was observed that the models were trained to generate more distinct masks between objects in the order of Adam, Lion, and AdamW. This is consistent with the observation that the recall metric for AdamW is higher.
As shown in the results from Table 2, both Adam and Lion achieved the highest performance in Case 9 and Case 8, respectively, when using and not using a scheduler. In contrast, AdamW showed the best performance in Case 6. However, for AdamW, the variation between cases was minimal, suggesting that selecting a specific case as the best by the evaluation metrics does not hold significant value.
In the case of Adam and Lion, the impact of learning rate and batch size was more pronounced. As the learning rate increased, prediction stability decreased, particularly affecting boundary expression leading to degraded performance. Boundaries became wiggly, and grid-like dots appeared inside the masks. This phenomenon can be attributed to the original SAM’s tendency to predict larger objects, a key difference between natural and satellite images. During training, SAM predicts uniquely shaped building segments over much larger areas, resulting in high losses and large parameter updates when predicting smaller buildings.
As a result, the model became significantly downgraded and failed to predict smaller buildings creating grid-patterned gaps in the masks. Additionally, high learning rates caused unstable boundary shapes, producing complex masks. Therefore, a lower learning rate appears preferable for fine-tuning SAM, as it helps mitigate these issues. With a lower learning rate, parameter updates remain small when mask prediction sharply fails and preventing the model from deteriorating. However, the model training process takes a longer time to converge. With a higher learning rate, trained models performed better in predicting uniquely shaped buildings but failed to predict larger and nonuniquely shaped buildings. As a result, a lower learning rate of 5 × 10–8 yielded better results.
Cases with larger batch sizes exhibited more accurate mask prediction results compared to those with smaller batch sizes. When predicting images with densely placed buildings, cases with smaller batch sizes tend to predict larger masks. This caused building masks to collide and appear as a single structure, similar to the original SAM. Training the model with a larger batch size partially mitigated this issue. However, models trained with AdamW were not able to improve this problem.
In the prediction results obtained by the original model and all the trained models, it was observed that patterned gaps appeared within the building masks. The examples of the inaccuracies are shown in Fig. 11. These inaccuracies cause not only low metrics but also complex boundaries of the building. The degree of this effect varied depending on the combinations of hyperparameters used. When using AdamW as the optimizer, the occurrence of patterns was less frequent. However, in the case of Adam and Lion, a significant increase in both the frequency and area of pattern occurrences was observed in specific cases. Cases 2, 3, and 6 had a higher rate of such errors, with case 3 in particular showing that the mask generation process itself did not proceed correctly.
We have tested 3 cases of simply upscaling the image and 3 cases of kinetically upscaling the image with different buffer sizes. The evaluation metric scores are shown in Figs. 12 and 13 and the example of mask prediction is shown in Figs. 14 and 15.
To examine the relationship between SAM’s performance and object size, we adopted a similar approach to Ren et al. (2024). We first evaluated the segmentation performance on images that had been upscaled by a factor of 2n (n=1,2,3). All trained models were evaluated using upscaled test images. Among the different optimizer types, Case 6 demonstrated the highest performance with AdamW, while Case 9 achieved the best performance with either Adam or Lion as the optimizer. A comparison was made with the models that achieved the highest performance for each optimizer. When the images were upscaled by a factor of 2 and 4, both metrics showed improved performance compared to the original scale across all three trained models.
However, when upscaled by a factor of 8, performance noticeably decreased. For the original model, the F1 Score improved to 0.7641 from 0.7276 (an increase of 0.0365) when the image was upscaled by a factor of 2, while the IoU increased significantly to 0.6393 from 0.5894 (an increase of 0.0499). Similarly, for the model trained using AdamW, the F1 Score increased to 0.7909 from 0.7767, and the IoU rose to 0.6751 from 0.6553, outperforming all previously trained models. Across all optimizers, the best performance was observed when the image scale was between 2x and 4x.
As the image scale increases, the model tends to misclassify nonbuilding areas. Additionally, as shown in Fig. 12, the prediction performance along the boundaries deteriorated during the process of reassembling the images that had been divided into grids. When upscaling was applied, it was observed that the performance improvement of the trained model was more limited compared to the original model.
Next, we evaluated the original model and the model trained with AdamW under the Case 6 setting by varying the buffer size. The buffer size determines the extent of the building included in the image. Three buffer sizes—1 pixel, 50 pixels, and 100 pixels— were assessed. Both the original model and the model trained with AdamW demonstrated improved performance in the order of applying buffers of 100 pixels, 50 pixels, and 1 pixel around the objects. However, similarly to the simple upscaling results, as the buffer size increased, the performance gap between the trained model and the original model decreased. With the kinetic cropping, the F1 Score reached 0.8694 from 0.7515 increments of 0.1179, when using the AdamW-trained model Case 6. For the original SAM model, the F1 Score has improved to 0.8499 from 0.7276 and IoU has increased to 0.5894 from 0.7438.
Visual comparisons also demonstrated an improvement in mask generation performance. When predicting areas with densely located buildings, it was observed that using the full image led to more accurate extraction of building boundaries, enabling the mask outputs to be generated without overlapping. However, the patterned inaccuracies tend to appear more in the trained models. Figs. 14 and 15 illustrate the generated prediction masks from various upscaling methods applied.
In this study, we optimized the application of the SAM model specializing it for satellite images through two primary approaches, given that SAM is predominantly pretrained on natural images. First, optimizing the hyperparameters allows SAM to improve segmentation quality on satellite images, specifically those from the KOMPSAT dataset in this study. When training SAM, the problem of occasional prediction of results with gap should be considered. It is crucial to develop strategies that minimize the occurrence of this phenomenon while enhancing building segmentation performance. Second, we applied SAM to images that were cropped object-by-object and buffered to further enhance performance.
Initially, we conducted a comparative analysis of the loss variation during training and the predictive performance of the trained models based on different hyperparameter combinations. It was confirmed that performance improved most significantly when using the AdamW optimizer. Specifically, the F1 Score increased from 0.7276 to 0.7767, a rise of 0.0491, while the IoU improved from 0.5894 to 0.6554, a gain of 0.066. When using AdamW, there were no significant differences related to learning rate and batch size. However, it was observed that even with a large batch size and high learning rate, stable and fast training was achieved when applying a learning rate scheduler during the training process. Additionally, it was confirmed that training could be conducted without increasing the occurrence of inaccuracies during the learning process, following the proposed hyperparameter settings.
To further examine the impact of image scaling, we compared performance when images were scaled by factors of 2x, 4x, and 8x, as well as when kinetic cropping and adding buffer were applied to objects. The results showed that applying kinetic cropping combined with a 100-pixel buffer on a model trained with AdamW led to a significant increase in the F1 Score, rising from 0.7767 to 0.8694, a substantial improvement of 0.1179. Performance enhancements visually confirmed that exhibiting images with densely populated buildings can be extracted without overlapping.
Despite these advancements, this study has certain limitations. There are still instances where SAM creates inaccuracies within the mask during mask generation. This issue is addressable through the application of HQ-SAM (Ke et al., 2023). Additionally, within a single epoch, the loss value decreased rapidly. Adjusting the dataset size will allow the loss value to either continuously decrease or remain stable over larger epochs. Moreover, removing items from the dataset that significantly reduce the learning effect is likely to enhance the effectiveness of training. Further research can explore improving the learning effect through boundary-aware learning during training, considering the direction of building segments and adjusting the image orientation.
This work is supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2022-00155763), and Korea Ministry of Land, Infrastructure and Transport (MOLIT) as 「Innovative Talent Education Program for Smart City」. The Institute of Engineering Research at Seoul National University provided research facilities for this work. This research (paper) used datasets from ‘The Open AI Dataset Project (AI-Hub, S. Korea)’. All data information can be accessed through ‘AI-Hub (www.aihub.or.kr)’.
No potential conflict of interest relevant to this article was reported.
Table 1 . Specification of KOMPSAT-3/3A satellites (Committee on Earth Observation Satellites, 2024).
Parameter | KOMPSAT-3 | KOMPSAT-3A |
---|---|---|
Launch data | May 18, 2012 | March 25, 2015 |
Ground sample distance | 0.70 m | 0.55 m |
Altitude | 625 km | 528 km |
Inclination | 98.13° | 97.513° |
Orbital period | 98.5 minutes | 95.2 minutes |
Number of revolutions | 14.6 revolutions/day | 15.1 revolutions/day |
Repeated ground track | 28 days / 423 revolutions | 28 days / 409 revolutions |
Table 2 . Test case number for combinations of learning rate and batch size.
Optimizers | Adam, AdamW, Lion | Scheduler | StepLR |
---|---|---|---|
Batch size | Learning rate | ||
5×10–8 | 1×10–7 | 2×10–7 | |
4 | Case 1 | Case 2 | Case 3 |
8 | Case 4 | Case 5 | Case 6 |
12 | Case 7 | Case 8 | Case 9 |
The case number applies identically for every optimizer tested..
Table 2. Quantitative metrics score results with the largest value bolded.
Optimizer | Scheduler | Case no. | F1 Score | IoU | mIoU | Precision | Recall | Accuracy |
---|---|---|---|---|---|---|---|---|
Original SAM | 0.72763 | 0.58943 | 0.65720 | 0.60019 | 0.97398 | 0.89239 | ||
Adam | O | 9 | 0.75148 | 0.62326 | 0.74637 | 0.64869 | 0.93911 | 0.90998 |
X | 8 | 0.75644 | 0.62366 | 0.74782 | 0.64801 | 0.94688 | 0.91008 | |
AdamW | O | 6 | 0.77668 | 0.65535 | 0.86478 | 0.66284 | 0.97814 | 0.91588 |
X | 6 | 0.77835 | 0.65148 | 0.84112 | 0.65961 | 0.98136 | 0.91581 | |
Lion | O | 9 | 0.75164 | 0.62356 | 0.74719 | 0.64842 | 0.94005 | 0.91008 |
X | 8 | 0.75630 | 0.62345 | 0.74776 | 0.64789 | 0.94674 | 0.90998 |
Kwangjae Lee
Korean J. Remote Sens. 2024; 40(5): 695-712Taejung Kim
Korean J. Remote Sens. 2024; 40(5): 691-694