Research Article

Split Viewer

Korean J. Remote Sens. 2024; 40(5): 551-568

Published online: October 31, 2024

https://doi.org/10.7780/kjrs.2024.40.5.1.11

© Korean Society of Remote Sensing

Optimal Hyperparameter Analysis of Segment Anything Model for Building Extraction Using KOMPSAT-3/3A Images

Donghyeon Lee1 , Jiyong Kim2 , Yongil Kim3*

1Master Student, Department of Civil and Environmental Engineering, Seoul National University, Seoul, Republic of Korea
2Researcher, Institute of Engineering Research, Seoul National University, Seoul, Republic of Korea
3Professor, Department of Civil and Environmental Engineering, Seoul National University, Seoul, Republic of Korea

Correspondence to : Yongil Kim
E-mail: yik@snu.ac.kr

Received: October 7, 2024; Revised: October 24, 2024; Accepted: October 24, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Extracting building information from Very-High-Resolution (VHR) satellite images is critical for urban mapping and monitoring. Traditional manual annotation methods are labor-intensive and costly, making automated solutions highly desirable. Segment Anything Model (SAM), a foundation model trained mostly on natural images, has recently shown high performance on diverse segmentation tasks. However, due to differences in perspective and the average size of objects in the images, SAM exhibits lower performance when extracting buildings from satellite imagery. These limitations, derived from differences in image domains, can be addressed by fine-tuning the model with satellite images and preprocessing the input images. However, various hyperparameters, such as learning rate, batch size, and optimizer type, deeply impact the performance of the fine-tuned model, and thus, in-depth investigations on these hyperparameters are critical for model adaptation. To identify the optimal hyperparameter configuration, we conducted extensive experiments with combinations of hyperparameter settings using Korea Multi-Purpose Satellite (KOMPSAT) images. Additionally, various upscaling methods and object-by-object preprocessing techniques were compared and evaluated, leading to the proposal of an effective preprocessing approach. With the optimal combination, an F1 Score of 0.862, an Intersection over Union (IoU) of 0.761, and a mean IoU (mIoU) of 0.705 were achieved using AdamW optimizer, object-by-object cropping, and 100-pixel buffering. The proposed hyperparameter optimization method in our research underscores the effectiveness of fine-tuning SAM for accurate building extraction in VHR satellite imagery, thereby enabling more reliable data interpretation and decision-making processes in automated remote sensing applications.

Keywords Building extraction, Segment anything model, Hyperparameter optimization, KOMPSAT

Extracting accurate spatial information on buildings is critical in urban planning, disaster monitoring, agricultural production prediction, and population estimation. Furthermore, the increase in very high-resolution satellite imagery has opened up new opportunities for extracting accurate building information. Accordingly, semantic segmentation (Xu et al., 2018), which involves annotating an image at the pixel level, is considered one of the most important subtasks in the analysis of building spatial information from optical satellite imageries.

Manual annotation of buildings requires an excessive amount of human intervention, which limits its practicality and scalability (Zheng et al., 2023). Moreover, the complex scenes that appear in satellite images, including the diversity of building types, shapes, sizes, and colors, greatly limit automated building extraction. Furthermore, buildings are usually placed closer to other buildings or roads, which makes it hard for machines to distinguish the boundaries of various objects. Considering the ability of Deep Neural Networks (DNNs) to incorporate both shallow and deep features into segmentation, the use of deep learning methods for automated building extraction has been investigated.

Convolutional Neural Networks (CNNs), which incorporate convolution layers on DNNs to deal with two-dimensional image data, have demonstrated exceptional ability in building segmentation (Krizhevsky et al., 2017). Therefore, Fully Convolutional Neural Networks (FCNs), an extension of CNN, were proposed to address arbitrary-sized images (Long et al., 2016). However, FCNs generate coarse-level building prediction. Two approaches were proposed to overcome this problem. The first approach involves using U-shaped encoder-decoder structures, such as U-Net (Ronneberger et al., 2015) and SegNet (Badrinarayanan et al., 2015) by constructing symmetrical decoders to encoders. The encoder captures features from the image and represents them in high-dimensional features, and the decoder restores the fine-resolution data from both high-dimensional features and low-dimensional features by using skip connections. Another approach is constructing multiscale networks, such as HRNet (Sun et al., 2019). By repeatedly fusing multiscale data, high-resolution features can be preserved.

However, CNN-based models suffer from drawbacks such as a high computation burden on pixel-level classification (Guo et al., 2018). Furthermore, CNN-based models have limitations in elaborating global information. In contrast, transformer architectures that utilize attention mechanisms (Vaswani et al., 2017) offer distinct advantages in capturing global information. Dosovitskiy et al. (2020) proposed a Vision Transformer (ViT) to adapt the self-attention mechanism to computer vision. Several researchers have applied transformer-based models for semantic segmentation using ViT. For instance, Wang et al. (2022b) employed a transformer architecture to develop a decoder, and Wang et al. (2022a) implemented a multipath network that extracts global context using ViT and spatial context through CNN.

Unlike CNN-based models, transformer-based models necessitate extensive datasets for training, which are often not readily available in the remote sensing domain. Consequently, transformer-based models are often pretrained and fined-tuned for specified applications. These pretrained models referred to as foundation models, are trained on large datasets comprising a substantial number of images to enhance their generalization capabilities. The Segment Anything Model (SAM) has increasingly found application in large-scale visual models, and it exhibits cutting-edge zero-shot capabilities in diverse settings by exploiting prompts (Kirillov et al., 2023). SAM is made up of three parts: an image encoder, a prompt encoder, and a mask decoder. Fig. 1 shows the SAM structure. Each part of SAM was pretrained using the SA-1B dataset, which contains more than 1 billion masks (Kirillov et al., 2023). A substantial portion of the dataset contains satellite images to handle its unique geometric properties.

Fig. 1. The structure of SAM proposed by Kirillov et al. (2023). The image encoder represents the image in high-level feature form, and the prompt encoder translates various types of prompts into a form that the SAM can understand. The mask decoder then uses features from encoders to predict the mask.

The unique perspective of satellite images introduces geometric differences that limit the straightforward application of State-of-the-Art (SOTA) models to this type of data (Verde et al., 2018; Chen et al., 2024a). Additionally, since satellite images capture large areas, buildings often appear small within the images. It is important to note that SAM is known to have a bias regarding the size of the objects it extracts (Ren et al., 2024). Therefore, it is expected that preprocessing the input images before applying SAM could improve extraction performance. Therefore, these unique characteristics related to perspective and object scale necessitate the optimization of hyperparameters and image preprocessing to achieve better performance.

Various researchers have attempted to increase the performance of the SAM for satellite images. Chen et al. (2024a) proposed an RSPrompter for prompt learning to adapt suitable prompts for remote sensing images. In terms of fully automated SAM utilization, Khatua et al. (2024) applied You Only Look Once (YOLOv8) object detection algorithm to generate prompts. However, these methods do not enhance the mask generation capacity itself, as SAM remains unmodified. Zhou et al. (2024) proposed two enhancements and introduced a fine-tuned version of SAM named MeSAM. First, they added adapter layers between transformer layers in SAM’s image encoder to preserve high spatial data in the remote sensing images. Second, they restructured and fine-tuned the mask decoder with an additional dataset to accommodate more tokens. Nonetheless, adding a new structure compromises the already limited usability of SAM due to the large image encoder structure. Regarding the general usage of SAM, Ke et al. (2023) identified the irregular pattern generated occasionally and added a high-quality output token to address the low-resolution problem. The occurrence of this phenomenon can be exacerbated by the selection of hyperparameters. Therefore, methods to optimize the training of satellite images in SAM should focus on identifying hyperparameter combinations that facilitate effective training while mitigating these issues. In this study, we first aim to determine the optimal hyperparameters for training SAM on Korea Multi-Purpose Satellite (KOMPSAT) images.

Huang et al. (2024) integrated multiscale segmentation with SAM to address the challenge of segmenting complex farmland parcels in high-resolution satellite imagery. To effectively extract parcels of varying sizes, they generated pre-segmentation results at multiple scales and used these results as prompts for SAM. Mazurowski et al. (2023) applied SAM to medical images containing objects of various sizes, confirming that SAM’s performance is proportional to the average object size in the images. However, this study focused solely on analyzing SAM’s applicability and did not propose any methods for enhancing its performance.

Based on a comprehensive analysis of previous research, we identified key challenges in training related to differences in image perspectives, stability during the training process, and the varying sizes of target objects for extraction. Our objective is to determine the optimal hyperparameter combination for training SAM’s mask decoder, specifically focused on satellite images, to address these issues. During training, the image encoder and prompt encoder remain frozen. In this study, SAM was trained on a specific satellite image dataset, namely KOMPSAT. Therefore, the objective of this study is to optimize SAM’s hyperparameters for training and to propose the cropping and buffering method for KOMPSAT images.

2.1. SAM

SAM is a foundation model developed for general semantic segmentation tasks (Kirillov et al., 2023). It has been trained on the SA-1B dataset, which was specifically created for SAM and contains over one billion features. Since SAM is designed to be a robust model applicable to various types of images, the SA-1B dataset primarily consists of natural images, with a small portion of satellite images. However, natural images have geometric differences compared to satellite images, leading to SAM’s lower performance in building extraction tasks.

SAM is built with three parts: image encoder, prompt encoder, and mask decoder. Fig. 1 shows the structure of SAM. Both the image encoder and the mask decoder are made up of vision transformer layers. The image encoder contains over 636 million parameters, and it creates image embedding for the mask decoder. The mask decoder is a lightweight structure compared to the image encoder, containing 4 million parameters. For the same image, the embeddings generated by the image encoder can be reused for predicting masks of different objects. In contrast, the prompt encoder and mask decoder are unable to handle batched input and must be executed separately for each set of prompts.

The prompt is an additional input that can specify the object to be segmented. Sparse prompts such as points, box, and text and dense prompts such as masks can be used for SAM when each prompt is available. The prompt encoder represents the point and box prompts using positional encoding (Tancik et al., 2020). Multiple types of prompts can be used simultaneously if they indicate the same object within the image. Nonetheless, bounding box prompts are known to give the most precise annotation for SAM in building extraction (Ren et al., 2024). The features from the image encoder and prompt encoder are then concatenated and handled by the mask decoder to draw the final mask result and prediction score.

2.2. Input Prompt Preprocessing

Since SAM requires prompting to segment a specific object, we preprocessed the building pixel coordinates in the label data to create a bounding box prompt for each building segment. We first calculated the maximum and minimum values of the building pixel coordinates to achieve the exact bounding box. To facilitate the generalized application of SAM, we applied a buffer area to bounding boxes when creating prompt inputs for two reasons. First, in most cases on the application of SAM, the exact bounding box coordinate will not be available. Second, in the dataset, the created label data have smaller dimensions than those shown in the image. Considering the spatial resolution of the image and the aforementioned reasons, a 10-pixel buffer was added when producing a bounding box prompt. Fig. 2 shows the preprocessing stage of creating a bounding box.

Fig. 2. Process of producing bounding box prompt. (Step 1) The left image shows the label coordinates with the building segment. (Step 2) The middle image shows the exact bounding box created by calculating maximum and minimum coordinates. (Step 3) The right shows the final bounding box with a buffer area.

2.3. Input Image Preprocessing

Ren et al. (2024) assessed SAM on various satellite datasets including Solar, 38-Cloud, DeepGlobe Roads, and SpaceNet 2. Additionally, they evaluated SAM’s performance on the DigitalGlobe Building and Inria Building datasets to focus on results related to building datasets. To dynamically analyze the impact of object scale on performance, they upscaled the input images by factors of 2, 4, and 8, and concluded that SAM’s performance can depend significantly on object scale. We applied a similar approach, which could be referred to as the kinetic cropping strategy, to conduct an in-depth analysis of the relationship between performance and scale. The kinetic cropping strategy involves cropping images one-to-one for bounding box prompts and using them as input. When the input image and bounding box prompt are given, the crop area is determined by selecting buffer pixels. We have tested 3 cases, by varying buffer pixels to 1, 50, and 100. Bilinear interpolation was used for all resizing. In Fig. 3, the preprocessing methods of cropping and buffering applied to the images can be observed.

Fig. 3. Process of preprocessing inputs. (Step 1) Calculate the image crop area with select buffer pixels. (Step 2) Crop the image with the area calculated in Step 1 and transform the bounding box prompt to match the new coordinate system. The red box and purple box indicate the original bounding box and crop area, respectively. The output in the form on the far right was used as the model’s input.

2.4. Fine-Tuning

We applied the adaptation method, a widely used method to improve model performance in deep learning (Gururangan et al., 2020). During training, only the parameters of the mask decoder were left trainable, while the other two components of SAM’s architecture were frozen. Four hyperparameters, optimizer, learning rate, batch size, and learning rate scheduler were tested to identify the optimal hyperparameter combination for building extraction with SAM.

For optimizers, Adaptive Moment Estimation (Adam; Kingma and Ba, 2014), Adam with decoupled Weight Decay (AdamW; Loshchilov and Frank, 2017), and Evolved Sign Momentum (Lion; Chen et al., 2024b) were tested. Adam and AdamW are optimization algorithms that use adaptive learning rates, but they differ in how they apply weight decay. In Adam, weight decay is applied directly during the update along with the learning rate. In AdamW, the parameter update is performed first, followed by the application of weight decay. This distinction allows AdamW to achieve more effective regularization. Lion optimizer does not average the gradients but instead tracks the direction of the gradients to achieve faster convergence. Instead of applying weight decay directly, it utilizes a technique that helps mitigate overfitting by maintaining a more stable trajectory during updates. Lion is believed to exhibit advantageous performance, especially in large-scale models.

The learning rate controls how much the model’s parameters are updated in each iteration. A higher learning rate can help the model avoid getting trapped in a local minimum and move toward the global minimum, but it may also cause convergence difficulty. Conversely, a lower learning rate can lead to convergence at local minima, increase the risk of overfitting, and slow down the convergence process. Batch size similarly impacts the training dynamics. A smaller batch size allows the model to learn unique cases better but may also increase the risk of overfitting. Therefore, selecting the appropriate learning rate and batch size is crucial for optimal performance. This is especially important for fine-tuning foundation models, where improper tuning can degrade pre-trained features and lead to lower performance.

3.1. Dataset

KOMPSAT series observation data enable independent Earth observation in Korea. Therefore, various studies have been conducted to effectively and actively utilize data from the KOMPSAT series (Acharya et al., 2016; Kim and Kim, 2023). The KOMPSAT-3/3A sister satellites are used for VHR optical observation and are appropriate for building extraction. Both satellites use the same optical sensor, the only difference being the added infrared capability in KOMPSAT-3A. Moreover, the orbit’s altitude is different, creating differences in the Ground Sample Distance (GSD). KOMPSAT-3 orbits 625 km above sea level and captures images of 0.70 m GSD, whereas KOMPSAT-3A operates at an altitude of 528 km and GSD of 0.55 m (Committee on Earth Observation Satellites, 2024). Table 1 summarizes the specifications of KOMPSAT-3/3A satellites.

Table 1 Specification of KOMPSAT-3/3A satellites (Committee on Earth Observation Satellites, 2024)

ParameterKOMPSAT-3KOMPSAT-3A
Launch dataMay 18, 2012March 25, 2015
Ground sample distance0.70 m0.55 m
Altitude625 km528 km
Inclination98.13°97.513°
Orbital period98.5 minutes95.2 minutes
Number of revolutions14.6 revolutions/day15.1 revolutions/day
Repeated ground track28 days / 423 revolutions28 days / 409 revolutions


In this study, AI-Hub Satellite Image Building Boundary Detection Dataset version 1.0 which is made of KOMPSAT-3/3A images was used. The dataset comprises data acquired from images taken in four cities (Los Angeles in the USA, Shanghai in China, Wolfsburg in Germany, and New Cairo in Egypt) and captures a variety of building forms by country and continent in both raster and polygon format. It provides the building coordinates in both the longitude and latitude coordinate system and the image pixel coordinates. The type of building is expressed by color in the label image; however, we did not utilize this information in this study. This dataset features over 150,000 buildings in 1238 tiles of officially divided training data and 50,000 buildings in 159 tiles of validation data. Each tile is 1,024 × 1,024 in a 0.55–0.70 m resolution. The whole image was used without cropping it since SAM’s standard input size is 1,024 × 1,024. We used training data for training and validation and then the validation data for testing. Fig. 4 shows an example of the original image and label image of the dataset.

Fig. 4. Sample image of the dataset. (a) shows the original train image, and (b) shows the corresponding label image. The label image is expressed in five colors for each type of building; however, in this study, we do not utilize the types of buildings.

3.2. Hyperparameter Selection

To identify the optimal hyperparameter combination for SAM, we conducted 27 experimental cases for each scenario, with and without a learning rate scheduler. These experiments varied the optimizer type, learning rate, and batch size, as these hyperparameters directly influence the training dynamics. The remaining training conditions, such as the loss function and batch size, were kept constant since they either affect the static objective of the training or do not significantly impact the training process. Each optimizer was tested with three learning rates: 5 × 10–8, 1 × 10–7, and 2 × 10–7, as well as three batch sizes: 4, 8, and 12. This resulted in nine different combinations for each optimizer. Maximum batch size was determined by the used GPU NVIDIA GeForce RTX 2080 Ti with 11 GB of memory. A batch size of 12 was the maximum batch size that the GPU could store the whole model and the embeddings. Smaller batch sizes of 1 or 2 were avoided due to the risk of overfitting. The learning rate candidate was chosen empirically based on several experimental trials. The test cases are listed in Table 2. When using the learning rate scheduler, the StepLR scheduler was used to reduce the learning rate by half every 2 epochs. When applying the Adam optimizer, an L2 regularization weight decay of 0.01 was chosen to prevent overfitting by adding a penalty term to the loss function. The same weight decay of 0.01 was applied to AdamW. Meanwhile, the default setting was used for Lion.

Table 2 Test case number for combinations of learning rate and batch size

OptimizersAdam, AdamW, LionSchedulerStepLR
Batch sizeLearning rate
5×10–81×10–72×10–7
4Case 1Case 2Case 3
8Case 4Case 5Case 6
12Case 7Case 8Case 9

The case number applies identically for every optimizer tested.



Other hyperparameters, such as loss function and number of epochs were fixed across experiments. These hyperparameters were fixed to focus solely on analyzing the dynamic hyperparameters that influence the training process. Binary cross entropy (BCE) with logit loss was used to compute loss and update the parameters in the training stage. It computes loss in one combined class of a sigmoid layer and the BCE loss. BCE loss is the most used loss function for semantic segmentation. BCE with logit loss can compute loss with better stability than BCE loss by reducing the value using the log sum exp trick. The probability of prediction is calculated and transformed between 0 and 1 using the sigmoid layer. A threshold of 0.5 was used to determine whether each pixel is a member of the building. The unreduced loss can be defined as:

L=1N i=1 N(yilogpi+(1yi)log(1pi)

where yi and pi, respectively, denote the label and predicted probability of pixel i, and N is the number of pixels in the minibatch.

Using the Case 9 setting with each optimizer, we trained the SAM model for 100 epochs to observe the trend in loss value fluctuation. As shown in Fig. 5, the training loss reaches its minimum value before epoch 10 and does not decrease further. In some instances, after reaching this minimum, the loss tends to remain static or increase slightly. Based on this observation, it is evident that 10 epochs are sufficient and that increasing the epoch number does not improve model performance. Therefore, an epoch size of 10 was consistently used throughout all experiments without variation.

Fig. 5. Prior test with 100 epochs. Case 9 setting was used for all optimizer types. We can observe that the loss value converges in an early epoch in every test case.

3.3. Implementation Details

As mentioned earlier, SAM requires a single specific object to be segmented. Therefore, the output of SAM exhibits a single object in the whole image. To match the form, we created a single object label image from ground truth data and used it to measure loss. Fig. 6 shows an example image of the preprocessed label image for every building segment separately. Furthermore, an erosion kernel was applied to avoid collisions between building masks, as the white boundary in the original label image appears larger than the actual building boundary (see Fig. 6), leading to overlaps among the building masks.

Fig. 6. Visualization of the label data preprocessing for fine-tuning. (a) and (b) display the original satellite image and its corresponding label image. (c) illustrates the preprocessed individual segment label images, where the example image contains 259 buildings, resulting in 259 individual label images prepared for fine-tuning. (d) shows the merged label image, combining all individual labels into a single image. The yellow boxes highlight the magnified portions, offering a closer view of the overlap between labels.

3.4. Evaluation Metrics

F1 Score, Intersection over Union (IoU), and mean IoU (mIoU) were used to measure the performance of the trained models along with Precision, Recall, and Accuracy. These metrics are widely used in semantic segmentation tasks, and they are defined as follows:

Precision=TPTP+FP Recall=TPTP+FN Accuracy=TP+TNTP+TN+FP+FN F1Score=2×Precision×RecallPrecision+Recall IoU=TPTP+FP+FN

where True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) denote the number of pixels classified as true positive, true negative, false positive, and false negative, respectively. Building and background pixels are denoted as positive and negative, respectively. IoU was measured as a whole patch prediction, where every building included in the image was merged, and mIoU was measured through each building segment prediction and then averaged. We defined the best hyperparameter set for SAM by comparing the preceding metrics.

4.1. Training Loss Stability

To examine the training stability and convergence speed, we compared the loss value during training. Fig. 7 shows the mean epoch loss value of all hyperparameter setting cases for each optimizer type used. The epoch numbers are denoted from 0 to 9, and for each case, the mean epoch loss is represented by a solid line when the learning rate scheduler is not used, and a dotted line when the scheduler is applied. The same case setting is represented by the same color.

Fig. 7. The mean epoch loss value for each training case is shown for the Adam, AdamW, and Lion optimizers, respectively. Cases that do not use the learning rate scheduler are represented with solid lines, while those that do use the scheduler are shown with dotted lines. The same set of cases is represented by the same color. When using AdamW, the training loss converges to a lower value and the process is stable. However, for the Adam and Lion optimizers, the training loss diverges to higher values in most cases without a scheduler, and even in some cases with a scheduler, indicating that the training process was not stable when using Adam and Lion.

Using Adam or Lion as an optimizer without the learning rate scheduler caused a significantly unstable training process leading to a large loss value even higher than the initial state. Cases with AdamW don’t show this problem, and all case training loss converged to a lower value. The final convergence loss value also shows a difference. When using AdamW, all cases converged to values between 0.0004 and 0.0007, which is lower than any cases with other optimizers. The minimum loss for optimizers Adam and Lion were both 0.0011 at Case 7 without the learning rate scheduler.

The usage of a learning rate scheduler has shown a crucial effect on training stability, especially for optimizers Adam and Lion. Without a scheduler, the divergence slope became higher. Cases 2, 3, 6, and 9 exhibited a stronger tendency. This trend was more pronounced in cases with a larger learning rate and smaller batch size showing commonality throughout the cases mentioned above. Still, all cases were initially successful in decreasing the loss value and reaching a minimum value of around 0.0012. However, when using the scheduler, some cases have failed to converge at minimum value. Therefore, model performance can vary regardless of whether train loss converges.

When using the AdamW optimizer, the training process is largely unaffected by changes in learning rate and batch size. The only difference observed was in the convergence speed; with a larger learning rate and smaller batch size, the number of iterations increased, leading to faster convergence. Furthermore, when a scheduler was used, the difference in trend diminished. Adam and Lion optimizer cases showed a similar trend with changes in learning rate and batch size. Larger batch size and smaller learning rate helped training to be more stable, while the opposite case—smaller batch size and larger learning rate—tended to introduce more variability and instability during training. It was observed that the batch size had a greater impact, likely because the number of iterations is determined by the batch size.

4.2. Mask Prediction Analysis

To analyze the performance of the trained models, we tested the models on the test dataset with both the latest model and the model with the lowest mean epoch loss. Both quantitative and qualitative results are evaluated to determine the best hyperparameter combination.

4.2.1. Quantitative Results

Quantitative comparisons were conducted, as shown in Fig. 8. For each optimizer, the metric scores of the best-performing cases are presented in Table 2, with the highest score highlighted in bold. From Fig. 8, it can be observed that all trained cases show better F1 Scores than the original SAM marked with a red dotted line.

Fig. 8. The first row shows the results with a scheduler, while the second row shows the results without one. From left to right, the bar graphs display the F1 Score, IoU, and mIoU metrics. The red dotted line indicates the original SAM model’s metric. Regardless of scheduler use, AdamW consistently outperformed the original SAM model, yielding better results. Additionally, when using a scheduler with AdamW, the performance metrics were more consistent.

Table 2 Quantitative metrics score results with the largest value bolded

OptimizerSchedulerCase no.F1 ScoreIoUmIoUPrecisionRecallAccuracy
Original SAM0.727630.589430.657200.600190.973980.89239
AdamO90.751480.623260.746370.648690.939110.90998
X80.756440.623660.747820.648010.946880.91008
AdamWO60.776680.655350.864780.662840.978140.91588
X60.778350.651480.841120.659610.981360.91581
LionO90.751640.623560.747190.648420.940050.91008
X80.756300.623450.747760.647890.946740.90998


All the best metrics achieved were with the cases trained by AdamW. When using AdamW, the trained model showed a higher F1 Score of 0.05, regardless of the scheduler used. However, when a scheduler was used, it was observed that the metric values were very closely clustered, which can be interpreted as indicating more consistent training. In the cases of Adam and Lion, the increase in the F1 Score varied depending on the use of a scheduler. When a scheduler was used, the increase was approximately 0.024, while without it, the increase was about 0.029.

The influence of batch size showcased consistent results across all cases, with larger batch sizes resulting in better performance. However, the impact of changes in the learning rate differed between AdamW and the other two optimizers, Adam and Lion. For AdamW, larger learning rates led to better performance, whereas for Adam and Lion, smaller learning rates produced better results. The best F1 Score for Adam was 0.7564 in Case 8, and the worst was 0.74016 in Case 3, both without using a learning rate scheduler. Similarly for Lion, the best F1 Score was 0.7563 in Case 8, and the worst was 0.7438 in Case 3, also without using a scheduler.

4.2.2. Qualitative Results

The original SAM model generated masks that were larger than the actual building sizes, resulting in overlapping areas between the masks. This creates an ambiguous boundary area between the segments, and the exact boundary of the building cannot be known. The trained models exhibited this characteristic to a lesser extent compared to the original SAM, allowing them to better distinguish the nearby building segments. Fig. 9 features the mask generation results from all 27 models trained using the learning rate scheduler, and Fig. 10 features select mask generation results from the mask prediction of 159 validation data images. It was observed that the models were trained to generate more distinct masks between objects in the order of Adam, Lion, and AdamW. This is consistent with the observation that the recall metric for AdamW is higher.

Fig. 9. Mask results from all cases using the learning rate scheduler are displayed. The first column shows the original image, label image, and mask predicted by the original SAM model. The mask results from each trained model are arranged in a 3 x 3 grid for each optimizer type.
Fig. 10. Five example images of the mask prediction results are shown. The first column displays the original test image from the dataset, while the second column presents the preprocessed label image. The third column shows the mask result from the original SAM model. The remaining columns depict the prediction results from the best-trained model for each optimizer type. In all cases, each building segment was predicted separately and then merged into a single image. The model trained with the AdamW optimizer tends to predict a more accurate form of the building mask. However, models trained with Adam or Lion were able to predict masks for more densely placed buildings.

As shown in the results from Table 2, both Adam and Lion achieved the highest performance in Case 9 and Case 8, respectively, when using and not using a scheduler. In contrast, AdamW showed the best performance in Case 6. However, for AdamW, the variation between cases was minimal, suggesting that selecting a specific case as the best by the evaluation metrics does not hold significant value.

In the case of Adam and Lion, the impact of learning rate and batch size was more pronounced. As the learning rate increased, prediction stability decreased, particularly affecting boundary expression leading to degraded performance. Boundaries became wiggly, and grid-like dots appeared inside the masks. This phenomenon can be attributed to the original SAM’s tendency to predict larger objects, a key difference between natural and satellite images. During training, SAM predicts uniquely shaped building segments over much larger areas, resulting in high losses and large parameter updates when predicting smaller buildings.

As a result, the model became significantly downgraded and failed to predict smaller buildings creating grid-patterned gaps in the masks. Additionally, high learning rates caused unstable boundary shapes, producing complex masks. Therefore, a lower learning rate appears preferable for fine-tuning SAM, as it helps mitigate these issues. With a lower learning rate, parameter updates remain small when mask prediction sharply fails and preventing the model from deteriorating. However, the model training process takes a longer time to converge. With a higher learning rate, trained models performed better in predicting uniquely shaped buildings but failed to predict larger and nonuniquely shaped buildings. As a result, a lower learning rate of 5 × 10–8 yielded better results.

Cases with larger batch sizes exhibited more accurate mask prediction results compared to those with smaller batch sizes. When predicting images with densely placed buildings, cases with smaller batch sizes tend to predict larger masks. This caused building masks to collide and appear as a single structure, similar to the original SAM. Training the model with a larger batch size partially mitigated this issue. However, models trained with AdamW were not able to improve this problem.

In the prediction results obtained by the original model and all the trained models, it was observed that patterned gaps appeared within the building masks. The examples of the inaccuracies are shown in Fig. 11. These inaccuracies cause not only low metrics but also complex boundaries of the building. The degree of this effect varied depending on the combinations of hyperparameters used. When using AdamW as the optimizer, the occurrence of patterns was less frequent. However, in the case of Adam and Lion, a significant increase in both the frequency and area of pattern occurrences was observed in specific cases. Cases 2, 3, and 6 had a higher rate of such errors, with case 3 in particular showing that the mask generation process itself did not proceed correctly.

Fig. 11. Example of patterned inaccuracies appearing within the masks from (a) original model, (b) Adam Case 2, and (c) Lion Case 2, respectively. (c) shows the unexpected additional pattern outside the building area.

4.3. Effect of Image Scale

We have tested 3 cases of simply upscaling the image and 3 cases of kinetically upscaling the image with different buffer sizes. The evaluation metric scores are shown in Figs. 12 and 13 and the example of mask prediction is shown in Figs. 14 and 15.

Fig. 12. F1 Score and IoU metrics measured when simple upscaling is applied are shown. In all cases, 2x upscaling showed the highest performance, with the models ranking in order of AdamW, Lion, Adam, and the original model. When upscaling was applied, the original model showed the largest performance improvement.
Fig. 13. F1 Score and IoU metrics measured when kinetic scaling was applied are shown. Each object was cropped with a selected buffer area. For the trained model, the AdamW optimizer with Case 6 was used for mask prediction. Metrics without scaling are indicated by different colored dotted lines for each model. When upscaling was applied, the original model showed a larger improvement in performance.
Fig. 14. The images below show the masks predicted by each model when using a grid-cut, simply upscaled image as input. From the first row, the results are shown for the original, 2x, 4x, and 8x upscaled images, respectively. From left to right, the results were predicted using the best-performing models trained with the original SAM, Adam, AdamW, and Lion optimizers. It was observed that occasional errors occurred in the regions where the predictions were interrupted by grid splits. When upscaled by 2x, the results showed that even in densely packed building areas, individual objects could be clearly distinguished.
Fig. 15. Mask prediction from three example images when cropping and buffering are applied. In the first row, the original image and label image are displayed. The second row shows the results of mask prediction without any preprocessing of the image. Starting from the third row, the results are shown in sequence for mask predictions where the image was cropped with buffers of 1 pixel, 50 pixels, and 100 pixels, respectively, before being used as input. In the odd-numbered columns, the results are from the original SAM model, while the even-numbered columns show the results from the model trained using the AdamW and Case 6 settings. When a narrow buffer was applied, it was observed that the original model struggled to make accurate predictions. In contrast, the trained model was still able to make predictions even in this case, and with a 100-pixel buffer, it was capable of generating highly accurate masks.

To examine the relationship between SAM’s performance and object size, we adopted a similar approach to Ren et al. (2024). We first evaluated the segmentation performance on images that had been upscaled by a factor of 2n (n=1,2,3). All trained models were evaluated using upscaled test images. Among the different optimizer types, Case 6 demonstrated the highest performance with AdamW, while Case 9 achieved the best performance with either Adam or Lion as the optimizer. A comparison was made with the models that achieved the highest performance for each optimizer. When the images were upscaled by a factor of 2 and 4, both metrics showed improved performance compared to the original scale across all three trained models.

However, when upscaled by a factor of 8, performance noticeably decreased. For the original model, the F1 Score improved to 0.7641 from 0.7276 (an increase of 0.0365) when the image was upscaled by a factor of 2, while the IoU increased significantly to 0.6393 from 0.5894 (an increase of 0.0499). Similarly, for the model trained using AdamW, the F1 Score increased to 0.7909 from 0.7767, and the IoU rose to 0.6751 from 0.6553, outperforming all previously trained models. Across all optimizers, the best performance was observed when the image scale was between 2x and 4x.

As the image scale increases, the model tends to misclassify nonbuilding areas. Additionally, as shown in Fig. 12, the prediction performance along the boundaries deteriorated during the process of reassembling the images that had been divided into grids. When upscaling was applied, it was observed that the performance improvement of the trained model was more limited compared to the original model.

Next, we evaluated the original model and the model trained with AdamW under the Case 6 setting by varying the buffer size. The buffer size determines the extent of the building included in the image. Three buffer sizes—1 pixel, 50 pixels, and 100 pixels— were assessed. Both the original model and the model trained with AdamW demonstrated improved performance in the order of applying buffers of 100 pixels, 50 pixels, and 1 pixel around the objects. However, similarly to the simple upscaling results, as the buffer size increased, the performance gap between the trained model and the original model decreased. With the kinetic cropping, the F1 Score reached 0.8694 from 0.7515 increments of 0.1179, when using the AdamW-trained model Case 6. For the original SAM model, the F1 Score has improved to 0.8499 from 0.7276 and IoU has increased to 0.5894 from 0.7438.

Visual comparisons also demonstrated an improvement in mask generation performance. When predicting areas with densely located buildings, it was observed that using the full image led to more accurate extraction of building boundaries, enabling the mask outputs to be generated without overlapping. However, the patterned inaccuracies tend to appear more in the trained models. Figs. 14 and 15 illustrate the generated prediction masks from various upscaling methods applied.

In this study, we optimized the application of the SAM model specializing it for satellite images through two primary approaches, given that SAM is predominantly pretrained on natural images. First, optimizing the hyperparameters allows SAM to improve segmentation quality on satellite images, specifically those from the KOMPSAT dataset in this study. When training SAM, the problem of occasional prediction of results with gap should be considered. It is crucial to develop strategies that minimize the occurrence of this phenomenon while enhancing building segmentation performance. Second, we applied SAM to images that were cropped object-by-object and buffered to further enhance performance.

Initially, we conducted a comparative analysis of the loss variation during training and the predictive performance of the trained models based on different hyperparameter combinations. It was confirmed that performance improved most significantly when using the AdamW optimizer. Specifically, the F1 Score increased from 0.7276 to 0.7767, a rise of 0.0491, while the IoU improved from 0.5894 to 0.6554, a gain of 0.066. When using AdamW, there were no significant differences related to learning rate and batch size. However, it was observed that even with a large batch size and high learning rate, stable and fast training was achieved when applying a learning rate scheduler during the training process. Additionally, it was confirmed that training could be conducted without increasing the occurrence of inaccuracies during the learning process, following the proposed hyperparameter settings.

To further examine the impact of image scaling, we compared performance when images were scaled by factors of 2x, 4x, and 8x, as well as when kinetic cropping and adding buffer were applied to objects. The results showed that applying kinetic cropping combined with a 100-pixel buffer on a model trained with AdamW led to a significant increase in the F1 Score, rising from 0.7767 to 0.8694, a substantial improvement of 0.1179. Performance enhancements visually confirmed that exhibiting images with densely populated buildings can be extracted without overlapping.

Despite these advancements, this study has certain limitations. There are still instances where SAM creates inaccuracies within the mask during mask generation. This issue is addressable through the application of HQ-SAM (Ke et al., 2023). Additionally, within a single epoch, the loss value decreased rapidly. Adjusting the dataset size will allow the loss value to either continuously decrease or remain stable over larger epochs. Moreover, removing items from the dataset that significantly reduce the learning effect is likely to enhance the effectiveness of training. Further research can explore improving the learning effect through boundary-aware learning during training, considering the direction of building segments and adjusting the image orientation.

This work is supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2022-00155763), and Korea Ministry of Land, Infrastructure and Transport (MOLIT) as 「Innovative Talent Education Program for Smart City」. The Institute of Engineering Research at Seoul National University provided research facilities for this work. This research (paper) used datasets from ‘The Open AI Dataset Project (AI-Hub, S. Korea)’. All data information can be accessed through ‘AI-Hub (www.aihub.or.kr)’.

  1. Acharya, T. D., Yang, I. T., and Lee, D. H., 2016. Land cover classification using a KOMPSAT-3A multi-spectral satellite image. Applied Sciences, 6(11), 371. https://doi.org/10.3390/app6110371
  2. Badrinarayanan, V., Handa, A., and Cipolla, R., 2015. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293. https://doi.org/10.48550/arXiv.1505.07293
  3. Chen, K., Liu, C., Chen, H., Zhang, H., Li, W., and Zou, Z., et al, 2024a. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Transactions on Geoscience and Remote Sensing, 62, 4701117. https://doi.org/10.1109/TGRS.2024.3356074
  4. Chen, X., Liang, C., Huang, D., Real, E., Wang, K., and Pham, H., et al, 2024b. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675. https://doi.org/10.48550/arXiv.2302.06675
  5. Committee on Earth Observation Satellites, 2024. CEOS EO handbook - Mission summary. Available online: https://database.eohandbook.com/database/missionsummary.aspx?missionID=698&utm_source=eoportal&utm_content=kompsat-3a (accessed on July 10, 2024)
  6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., and Unterthiner, T., et al, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929
  7. Guo, Y., Liu, Y., Georgiou, T., and Lew, M. S., 2018. A review of semantic segmentation using deep neural networks. International Journal of Multimedia Information Retrieval, 7, 87-93. https://doi.org/10.1007/s13735-017-0141-z
  8. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., and Downey, D., et al, 2020. Don't stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964. https://doi.org/10.48550/arXiv.2004.10964
  9. Huang, Z., Jing, H., Liu, Y., Yang, X., Wang, Z., and Liu, X., et al, 2024. Segment anything model combined with multi-scale segmentation for extracting complex cultivated land parcels in high-resolution remote sensing images. Remote Sensing, 16(18), 3489. https://doi.org/10.3390/rs16183489
  10. Ke, L., Ye, M., Danelljan, M., Tai, Y. W., Tang, C. K., and Yu, F., 2023. Segment anything in high quality. arXiv preprint arXiv:2306.01567. https://doi.org/10.48550/arXiv.2306.01567
  11. Khatua, A., Bhattacharya, A., Goswami, A. K., and Aithal, B. H., 2024. Developing approaches in building classification and extraction with synergy of YOLOV8 and SAM models. Spatial Information Research, 32, 511-530. https://doi.org/10.1007/s41324-024-00574-0
  12. Kim, S., and Kim, T., 2023. Automated extraction of orthorectified building layer from high-resolution satellite images. Korean Journal of Remote Sensing, 39(3), 339-353. https://doi.org/10.7780/kjrs.2023.39.3.7
  13. Kingma, D. P., and Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. https://doi.org/10.48550/arXiv.1412.6980
  14. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., and Gustafson, L., et al, 2023. Segment anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, Oct. 1-6, pp. 4015-4026. https://doi.org/10.1109/iccv51070.2023.00371
  15. Krizhevsky, A., Sutskever, I., and Hinton, G. E., 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90. https://doi.org/10.1145/3065386
  16. Long, J., Shelhamer, E., and Darrell, T., 2016. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1605.06211. https://doi.org/10.48550/arXiv.1605.06211
  17. Loshchilov, I., and Frank, H., 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. https://doi.org/10.48550/arXiv.1711.05101
  18. Mazurowski, M. A., Dong, H., Gu, H., Yang, J., Konz, N., and Zhang, Y., 2023. Segment anything model for medical image analysis: An experimental study. Medical Image Analysis, 89, 102918. https://doi.org/10.1016/j.media.2023.102918
  19. Ren, S., Luzi, F., Lahrichi, S., Kassaw, K., Collins, L. M., and Bradbury, K., et al, 2024. Segment anything, from space?. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, Jan. 3-8, pp. 8355-8365. https://doi.org/10.1109/wacv57701.2024.00817
  20. Ronneberger, O., Fischer, P., and Brox, T., 2015. U-Net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A., (eds.), Medical image computing and computer-assisted intervention - MICCAI 2015, Springer, pp. 234-241. https://doi.org/10.1007/978-3-319-24574-4_28
  21. Sun, K., Xiao, B., Liu, D., and Wang, J., 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, June 15-20, pp. 5693-5703. https://doi.org/10.1109/cvpr.2019.00584
  22. Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., and Singhal, U., et al, 2020. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739. https://doi.org/10.48550/arXiv.2006.10739
  23. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., and Gomez, A. N., et al, 2017. Attention is all you need. arXiv preprint arXiv:1706.03762. https://doi.org/10.48550/arXiv.1706.03762
  24. Verde, N., Mallinis, G., Tsakiri-Strati, M., Georgiadis, C., and Patias, P., 2018. Assessment of radiometric resolution impact on remote sensing data classification accuracy. Remote Sensing, 10(8), 1267. https://doi.org/10.3390/rs10081267
  25. Wang, L., Fang, S., Meng, X., and Li, R., 2022a. Building extraction with vision transformer. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-11. https://doi.org/10.1109/TGRS.2022.3186634
  26. Wang, L., Li, R., Zhang, C., Fang, S., Duan, C., and Meng, X., et al, 2022b. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 190, 196-214. https://doi.org/10.1016/j.isprsjprs.2022.06.008
  27. Xu, Y., Wu, L., Xie, Z., and Chen, Z., 2018. Building extraction in very high resolution remote sensing imagery using deep learning and guided filters. Remote Sensing, 10(1), 144. https://doi.org/10.3390/rs10010144
  28. Zheng, D., Li, S., Fang, F., Zhang, J., Feng, Y., and Wan, B., et al, 2023. Utilizing bounding box annotations for weakly supervised building extraction from remote-sensing images. IEEE Transactions on Geoscience and Remote Sensing, 61, 1-17. https://doi.org/10.1109/tgrs.2023.3271986
  29. Zhou, X., Liang, F., Chen, L., Liu, H., Song, Q., and Vivone, G., et al, 2024. MeSAM: Multiscale enhanced segment anything model for optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 62, 5623515. https://doi.org/10.1109/tgrs.2024.3398038

Research Article

Korean J. Remote Sens. 2024; 40(5): 551-568

Published online October 31, 2024 https://doi.org/10.7780/kjrs.2024.40.5.1.11

Copyright © Korean Society of Remote Sensing.

Optimal Hyperparameter Analysis of Segment Anything Model for Building Extraction Using KOMPSAT-3/3A Images

Donghyeon Lee1 , Jiyong Kim2 , Yongil Kim3*

1Master Student, Department of Civil and Environmental Engineering, Seoul National University, Seoul, Republic of Korea
2Researcher, Institute of Engineering Research, Seoul National University, Seoul, Republic of Korea
3Professor, Department of Civil and Environmental Engineering, Seoul National University, Seoul, Republic of Korea

Correspondence to:Yongil Kim
E-mail: yik@snu.ac.kr

Received: October 7, 2024; Revised: October 24, 2024; Accepted: October 24, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Extracting building information from Very-High-Resolution (VHR) satellite images is critical for urban mapping and monitoring. Traditional manual annotation methods are labor-intensive and costly, making automated solutions highly desirable. Segment Anything Model (SAM), a foundation model trained mostly on natural images, has recently shown high performance on diverse segmentation tasks. However, due to differences in perspective and the average size of objects in the images, SAM exhibits lower performance when extracting buildings from satellite imagery. These limitations, derived from differences in image domains, can be addressed by fine-tuning the model with satellite images and preprocessing the input images. However, various hyperparameters, such as learning rate, batch size, and optimizer type, deeply impact the performance of the fine-tuned model, and thus, in-depth investigations on these hyperparameters are critical for model adaptation. To identify the optimal hyperparameter configuration, we conducted extensive experiments with combinations of hyperparameter settings using Korea Multi-Purpose Satellite (KOMPSAT) images. Additionally, various upscaling methods and object-by-object preprocessing techniques were compared and evaluated, leading to the proposal of an effective preprocessing approach. With the optimal combination, an F1 Score of 0.862, an Intersection over Union (IoU) of 0.761, and a mean IoU (mIoU) of 0.705 were achieved using AdamW optimizer, object-by-object cropping, and 100-pixel buffering. The proposed hyperparameter optimization method in our research underscores the effectiveness of fine-tuning SAM for accurate building extraction in VHR satellite imagery, thereby enabling more reliable data interpretation and decision-making processes in automated remote sensing applications.

Keywords: Building extraction, Segment anything model, Hyperparameter optimization, KOMPSAT

1. Introduction

Extracting accurate spatial information on buildings is critical in urban planning, disaster monitoring, agricultural production prediction, and population estimation. Furthermore, the increase in very high-resolution satellite imagery has opened up new opportunities for extracting accurate building information. Accordingly, semantic segmentation (Xu et al., 2018), which involves annotating an image at the pixel level, is considered one of the most important subtasks in the analysis of building spatial information from optical satellite imageries.

Manual annotation of buildings requires an excessive amount of human intervention, which limits its practicality and scalability (Zheng et al., 2023). Moreover, the complex scenes that appear in satellite images, including the diversity of building types, shapes, sizes, and colors, greatly limit automated building extraction. Furthermore, buildings are usually placed closer to other buildings or roads, which makes it hard for machines to distinguish the boundaries of various objects. Considering the ability of Deep Neural Networks (DNNs) to incorporate both shallow and deep features into segmentation, the use of deep learning methods for automated building extraction has been investigated.

Convolutional Neural Networks (CNNs), which incorporate convolution layers on DNNs to deal with two-dimensional image data, have demonstrated exceptional ability in building segmentation (Krizhevsky et al., 2017). Therefore, Fully Convolutional Neural Networks (FCNs), an extension of CNN, were proposed to address arbitrary-sized images (Long et al., 2016). However, FCNs generate coarse-level building prediction. Two approaches were proposed to overcome this problem. The first approach involves using U-shaped encoder-decoder structures, such as U-Net (Ronneberger et al., 2015) and SegNet (Badrinarayanan et al., 2015) by constructing symmetrical decoders to encoders. The encoder captures features from the image and represents them in high-dimensional features, and the decoder restores the fine-resolution data from both high-dimensional features and low-dimensional features by using skip connections. Another approach is constructing multiscale networks, such as HRNet (Sun et al., 2019). By repeatedly fusing multiscale data, high-resolution features can be preserved.

However, CNN-based models suffer from drawbacks such as a high computation burden on pixel-level classification (Guo et al., 2018). Furthermore, CNN-based models have limitations in elaborating global information. In contrast, transformer architectures that utilize attention mechanisms (Vaswani et al., 2017) offer distinct advantages in capturing global information. Dosovitskiy et al. (2020) proposed a Vision Transformer (ViT) to adapt the self-attention mechanism to computer vision. Several researchers have applied transformer-based models for semantic segmentation using ViT. For instance, Wang et al. (2022b) employed a transformer architecture to develop a decoder, and Wang et al. (2022a) implemented a multipath network that extracts global context using ViT and spatial context through CNN.

Unlike CNN-based models, transformer-based models necessitate extensive datasets for training, which are often not readily available in the remote sensing domain. Consequently, transformer-based models are often pretrained and fined-tuned for specified applications. These pretrained models referred to as foundation models, are trained on large datasets comprising a substantial number of images to enhance their generalization capabilities. The Segment Anything Model (SAM) has increasingly found application in large-scale visual models, and it exhibits cutting-edge zero-shot capabilities in diverse settings by exploiting prompts (Kirillov et al., 2023). SAM is made up of three parts: an image encoder, a prompt encoder, and a mask decoder. Fig. 1 shows the SAM structure. Each part of SAM was pretrained using the SA-1B dataset, which contains more than 1 billion masks (Kirillov et al., 2023). A substantial portion of the dataset contains satellite images to handle its unique geometric properties.

Figure 1. The structure of SAM proposed by Kirillov et al. (2023). The image encoder represents the image in high-level feature form, and the prompt encoder translates various types of prompts into a form that the SAM can understand. The mask decoder then uses features from encoders to predict the mask.

The unique perspective of satellite images introduces geometric differences that limit the straightforward application of State-of-the-Art (SOTA) models to this type of data (Verde et al., 2018; Chen et al., 2024a). Additionally, since satellite images capture large areas, buildings often appear small within the images. It is important to note that SAM is known to have a bias regarding the size of the objects it extracts (Ren et al., 2024). Therefore, it is expected that preprocessing the input images before applying SAM could improve extraction performance. Therefore, these unique characteristics related to perspective and object scale necessitate the optimization of hyperparameters and image preprocessing to achieve better performance.

Various researchers have attempted to increase the performance of the SAM for satellite images. Chen et al. (2024a) proposed an RSPrompter for prompt learning to adapt suitable prompts for remote sensing images. In terms of fully automated SAM utilization, Khatua et al. (2024) applied You Only Look Once (YOLOv8) object detection algorithm to generate prompts. However, these methods do not enhance the mask generation capacity itself, as SAM remains unmodified. Zhou et al. (2024) proposed two enhancements and introduced a fine-tuned version of SAM named MeSAM. First, they added adapter layers between transformer layers in SAM’s image encoder to preserve high spatial data in the remote sensing images. Second, they restructured and fine-tuned the mask decoder with an additional dataset to accommodate more tokens. Nonetheless, adding a new structure compromises the already limited usability of SAM due to the large image encoder structure. Regarding the general usage of SAM, Ke et al. (2023) identified the irregular pattern generated occasionally and added a high-quality output token to address the low-resolution problem. The occurrence of this phenomenon can be exacerbated by the selection of hyperparameters. Therefore, methods to optimize the training of satellite images in SAM should focus on identifying hyperparameter combinations that facilitate effective training while mitigating these issues. In this study, we first aim to determine the optimal hyperparameters for training SAM on Korea Multi-Purpose Satellite (KOMPSAT) images.

Huang et al. (2024) integrated multiscale segmentation with SAM to address the challenge of segmenting complex farmland parcels in high-resolution satellite imagery. To effectively extract parcels of varying sizes, they generated pre-segmentation results at multiple scales and used these results as prompts for SAM. Mazurowski et al. (2023) applied SAM to medical images containing objects of various sizes, confirming that SAM’s performance is proportional to the average object size in the images. However, this study focused solely on analyzing SAM’s applicability and did not propose any methods for enhancing its performance.

Based on a comprehensive analysis of previous research, we identified key challenges in training related to differences in image perspectives, stability during the training process, and the varying sizes of target objects for extraction. Our objective is to determine the optimal hyperparameter combination for training SAM’s mask decoder, specifically focused on satellite images, to address these issues. During training, the image encoder and prompt encoder remain frozen. In this study, SAM was trained on a specific satellite image dataset, namely KOMPSAT. Therefore, the objective of this study is to optimize SAM’s hyperparameters for training and to propose the cropping and buffering method for KOMPSAT images.

2. Methods

2.1. SAM

SAM is a foundation model developed for general semantic segmentation tasks (Kirillov et al., 2023). It has been trained on the SA-1B dataset, which was specifically created for SAM and contains over one billion features. Since SAM is designed to be a robust model applicable to various types of images, the SA-1B dataset primarily consists of natural images, with a small portion of satellite images. However, natural images have geometric differences compared to satellite images, leading to SAM’s lower performance in building extraction tasks.

SAM is built with three parts: image encoder, prompt encoder, and mask decoder. Fig. 1 shows the structure of SAM. Both the image encoder and the mask decoder are made up of vision transformer layers. The image encoder contains over 636 million parameters, and it creates image embedding for the mask decoder. The mask decoder is a lightweight structure compared to the image encoder, containing 4 million parameters. For the same image, the embeddings generated by the image encoder can be reused for predicting masks of different objects. In contrast, the prompt encoder and mask decoder are unable to handle batched input and must be executed separately for each set of prompts.

The prompt is an additional input that can specify the object to be segmented. Sparse prompts such as points, box, and text and dense prompts such as masks can be used for SAM when each prompt is available. The prompt encoder represents the point and box prompts using positional encoding (Tancik et al., 2020). Multiple types of prompts can be used simultaneously if they indicate the same object within the image. Nonetheless, bounding box prompts are known to give the most precise annotation for SAM in building extraction (Ren et al., 2024). The features from the image encoder and prompt encoder are then concatenated and handled by the mask decoder to draw the final mask result and prediction score.

2.2. Input Prompt Preprocessing

Since SAM requires prompting to segment a specific object, we preprocessed the building pixel coordinates in the label data to create a bounding box prompt for each building segment. We first calculated the maximum and minimum values of the building pixel coordinates to achieve the exact bounding box. To facilitate the generalized application of SAM, we applied a buffer area to bounding boxes when creating prompt inputs for two reasons. First, in most cases on the application of SAM, the exact bounding box coordinate will not be available. Second, in the dataset, the created label data have smaller dimensions than those shown in the image. Considering the spatial resolution of the image and the aforementioned reasons, a 10-pixel buffer was added when producing a bounding box prompt. Fig. 2 shows the preprocessing stage of creating a bounding box.

Figure 2. Process of producing bounding box prompt. (Step 1) The left image shows the label coordinates with the building segment. (Step 2) The middle image shows the exact bounding box created by calculating maximum and minimum coordinates. (Step 3) The right shows the final bounding box with a buffer area.

2.3. Input Image Preprocessing

Ren et al. (2024) assessed SAM on various satellite datasets including Solar, 38-Cloud, DeepGlobe Roads, and SpaceNet 2. Additionally, they evaluated SAM’s performance on the DigitalGlobe Building and Inria Building datasets to focus on results related to building datasets. To dynamically analyze the impact of object scale on performance, they upscaled the input images by factors of 2, 4, and 8, and concluded that SAM’s performance can depend significantly on object scale. We applied a similar approach, which could be referred to as the kinetic cropping strategy, to conduct an in-depth analysis of the relationship between performance and scale. The kinetic cropping strategy involves cropping images one-to-one for bounding box prompts and using them as input. When the input image and bounding box prompt are given, the crop area is determined by selecting buffer pixels. We have tested 3 cases, by varying buffer pixels to 1, 50, and 100. Bilinear interpolation was used for all resizing. In Fig. 3, the preprocessing methods of cropping and buffering applied to the images can be observed.

Figure 3. Process of preprocessing inputs. (Step 1) Calculate the image crop area with select buffer pixels. (Step 2) Crop the image with the area calculated in Step 1 and transform the bounding box prompt to match the new coordinate system. The red box and purple box indicate the original bounding box and crop area, respectively. The output in the form on the far right was used as the model’s input.

2.4. Fine-Tuning

We applied the adaptation method, a widely used method to improve model performance in deep learning (Gururangan et al., 2020). During training, only the parameters of the mask decoder were left trainable, while the other two components of SAM’s architecture were frozen. Four hyperparameters, optimizer, learning rate, batch size, and learning rate scheduler were tested to identify the optimal hyperparameter combination for building extraction with SAM.

For optimizers, Adaptive Moment Estimation (Adam; Kingma and Ba, 2014), Adam with decoupled Weight Decay (AdamW; Loshchilov and Frank, 2017), and Evolved Sign Momentum (Lion; Chen et al., 2024b) were tested. Adam and AdamW are optimization algorithms that use adaptive learning rates, but they differ in how they apply weight decay. In Adam, weight decay is applied directly during the update along with the learning rate. In AdamW, the parameter update is performed first, followed by the application of weight decay. This distinction allows AdamW to achieve more effective regularization. Lion optimizer does not average the gradients but instead tracks the direction of the gradients to achieve faster convergence. Instead of applying weight decay directly, it utilizes a technique that helps mitigate overfitting by maintaining a more stable trajectory during updates. Lion is believed to exhibit advantageous performance, especially in large-scale models.

The learning rate controls how much the model’s parameters are updated in each iteration. A higher learning rate can help the model avoid getting trapped in a local minimum and move toward the global minimum, but it may also cause convergence difficulty. Conversely, a lower learning rate can lead to convergence at local minima, increase the risk of overfitting, and slow down the convergence process. Batch size similarly impacts the training dynamics. A smaller batch size allows the model to learn unique cases better but may also increase the risk of overfitting. Therefore, selecting the appropriate learning rate and batch size is crucial for optimal performance. This is especially important for fine-tuning foundation models, where improper tuning can degrade pre-trained features and lead to lower performance.

3. Experimental Design

3.1. Dataset

KOMPSAT series observation data enable independent Earth observation in Korea. Therefore, various studies have been conducted to effectively and actively utilize data from the KOMPSAT series (Acharya et al., 2016; Kim and Kim, 2023). The KOMPSAT-3/3A sister satellites are used for VHR optical observation and are appropriate for building extraction. Both satellites use the same optical sensor, the only difference being the added infrared capability in KOMPSAT-3A. Moreover, the orbit’s altitude is different, creating differences in the Ground Sample Distance (GSD). KOMPSAT-3 orbits 625 km above sea level and captures images of 0.70 m GSD, whereas KOMPSAT-3A operates at an altitude of 528 km and GSD of 0.55 m (Committee on Earth Observation Satellites, 2024). Table 1 summarizes the specifications of KOMPSAT-3/3A satellites.

Table 1 . Specification of KOMPSAT-3/3A satellites (Committee on Earth Observation Satellites, 2024).

ParameterKOMPSAT-3KOMPSAT-3A
Launch dataMay 18, 2012March 25, 2015
Ground sample distance0.70 m0.55 m
Altitude625 km528 km
Inclination98.13°97.513°
Orbital period98.5 minutes95.2 minutes
Number of revolutions14.6 revolutions/day15.1 revolutions/day
Repeated ground track28 days / 423 revolutions28 days / 409 revolutions


In this study, AI-Hub Satellite Image Building Boundary Detection Dataset version 1.0 which is made of KOMPSAT-3/3A images was used. The dataset comprises data acquired from images taken in four cities (Los Angeles in the USA, Shanghai in China, Wolfsburg in Germany, and New Cairo in Egypt) and captures a variety of building forms by country and continent in both raster and polygon format. It provides the building coordinates in both the longitude and latitude coordinate system and the image pixel coordinates. The type of building is expressed by color in the label image; however, we did not utilize this information in this study. This dataset features over 150,000 buildings in 1238 tiles of officially divided training data and 50,000 buildings in 159 tiles of validation data. Each tile is 1,024 × 1,024 in a 0.55–0.70 m resolution. The whole image was used without cropping it since SAM’s standard input size is 1,024 × 1,024. We used training data for training and validation and then the validation data for testing. Fig. 4 shows an example of the original image and label image of the dataset.

Figure 4. Sample image of the dataset. (a) shows the original train image, and (b) shows the corresponding label image. The label image is expressed in five colors for each type of building; however, in this study, we do not utilize the types of buildings.

3.2. Hyperparameter Selection

To identify the optimal hyperparameter combination for SAM, we conducted 27 experimental cases for each scenario, with and without a learning rate scheduler. These experiments varied the optimizer type, learning rate, and batch size, as these hyperparameters directly influence the training dynamics. The remaining training conditions, such as the loss function and batch size, were kept constant since they either affect the static objective of the training or do not significantly impact the training process. Each optimizer was tested with three learning rates: 5 × 10–8, 1 × 10–7, and 2 × 10–7, as well as three batch sizes: 4, 8, and 12. This resulted in nine different combinations for each optimizer. Maximum batch size was determined by the used GPU NVIDIA GeForce RTX 2080 Ti with 11 GB of memory. A batch size of 12 was the maximum batch size that the GPU could store the whole model and the embeddings. Smaller batch sizes of 1 or 2 were avoided due to the risk of overfitting. The learning rate candidate was chosen empirically based on several experimental trials. The test cases are listed in Table 2. When using the learning rate scheduler, the StepLR scheduler was used to reduce the learning rate by half every 2 epochs. When applying the Adam optimizer, an L2 regularization weight decay of 0.01 was chosen to prevent overfitting by adding a penalty term to the loss function. The same weight decay of 0.01 was applied to AdamW. Meanwhile, the default setting was used for Lion.

Table 2 . Test case number for combinations of learning rate and batch size.

OptimizersAdam, AdamW, LionSchedulerStepLR
Batch sizeLearning rate
5×10–81×10–72×10–7
4Case 1Case 2Case 3
8Case 4Case 5Case 6
12Case 7Case 8Case 9

The case number applies identically for every optimizer tested..



Other hyperparameters, such as loss function and number of epochs were fixed across experiments. These hyperparameters were fixed to focus solely on analyzing the dynamic hyperparameters that influence the training process. Binary cross entropy (BCE) with logit loss was used to compute loss and update the parameters in the training stage. It computes loss in one combined class of a sigmoid layer and the BCE loss. BCE loss is the most used loss function for semantic segmentation. BCE with logit loss can compute loss with better stability than BCE loss by reducing the value using the log sum exp trick. The probability of prediction is calculated and transformed between 0 and 1 using the sigmoid layer. A threshold of 0.5 was used to determine whether each pixel is a member of the building. The unreduced loss can be defined as:

L=1N i=1 N(yilogpi+(1yi)log(1pi)

where yi and pi, respectively, denote the label and predicted probability of pixel i, and N is the number of pixels in the minibatch.

Using the Case 9 setting with each optimizer, we trained the SAM model for 100 epochs to observe the trend in loss value fluctuation. As shown in Fig. 5, the training loss reaches its minimum value before epoch 10 and does not decrease further. In some instances, after reaching this minimum, the loss tends to remain static or increase slightly. Based on this observation, it is evident that 10 epochs are sufficient and that increasing the epoch number does not improve model performance. Therefore, an epoch size of 10 was consistently used throughout all experiments without variation.

Figure 5. Prior test with 100 epochs. Case 9 setting was used for all optimizer types. We can observe that the loss value converges in an early epoch in every test case.

3.3. Implementation Details

As mentioned earlier, SAM requires a single specific object to be segmented. Therefore, the output of SAM exhibits a single object in the whole image. To match the form, we created a single object label image from ground truth data and used it to measure loss. Fig. 6 shows an example image of the preprocessed label image for every building segment separately. Furthermore, an erosion kernel was applied to avoid collisions between building masks, as the white boundary in the original label image appears larger than the actual building boundary (see Fig. 6), leading to overlaps among the building masks.

Figure 6. Visualization of the label data preprocessing for fine-tuning. (a) and (b) display the original satellite image and its corresponding label image. (c) illustrates the preprocessed individual segment label images, where the example image contains 259 buildings, resulting in 259 individual label images prepared for fine-tuning. (d) shows the merged label image, combining all individual labels into a single image. The yellow boxes highlight the magnified portions, offering a closer view of the overlap between labels.

3.4. Evaluation Metrics

F1 Score, Intersection over Union (IoU), and mean IoU (mIoU) were used to measure the performance of the trained models along with Precision, Recall, and Accuracy. These metrics are widely used in semantic segmentation tasks, and they are defined as follows:

Precision=TPTP+FP Recall=TPTP+FN Accuracy=TP+TNTP+TN+FP+FN F1Score=2×Precision×RecallPrecision+Recall IoU=TPTP+FP+FN

where True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) denote the number of pixels classified as true positive, true negative, false positive, and false negative, respectively. Building and background pixels are denoted as positive and negative, respectively. IoU was measured as a whole patch prediction, where every building included in the image was merged, and mIoU was measured through each building segment prediction and then averaged. We defined the best hyperparameter set for SAM by comparing the preceding metrics.

4. Results and Discussion

4.1. Training Loss Stability

To examine the training stability and convergence speed, we compared the loss value during training. Fig. 7 shows the mean epoch loss value of all hyperparameter setting cases for each optimizer type used. The epoch numbers are denoted from 0 to 9, and for each case, the mean epoch loss is represented by a solid line when the learning rate scheduler is not used, and a dotted line when the scheduler is applied. The same case setting is represented by the same color.

Figure 7. The mean epoch loss value for each training case is shown for the Adam, AdamW, and Lion optimizers, respectively. Cases that do not use the learning rate scheduler are represented with solid lines, while those that do use the scheduler are shown with dotted lines. The same set of cases is represented by the same color. When using AdamW, the training loss converges to a lower value and the process is stable. However, for the Adam and Lion optimizers, the training loss diverges to higher values in most cases without a scheduler, and even in some cases with a scheduler, indicating that the training process was not stable when using Adam and Lion.

Using Adam or Lion as an optimizer without the learning rate scheduler caused a significantly unstable training process leading to a large loss value even higher than the initial state. Cases with AdamW don’t show this problem, and all case training loss converged to a lower value. The final convergence loss value also shows a difference. When using AdamW, all cases converged to values between 0.0004 and 0.0007, which is lower than any cases with other optimizers. The minimum loss for optimizers Adam and Lion were both 0.0011 at Case 7 without the learning rate scheduler.

The usage of a learning rate scheduler has shown a crucial effect on training stability, especially for optimizers Adam and Lion. Without a scheduler, the divergence slope became higher. Cases 2, 3, 6, and 9 exhibited a stronger tendency. This trend was more pronounced in cases with a larger learning rate and smaller batch size showing commonality throughout the cases mentioned above. Still, all cases were initially successful in decreasing the loss value and reaching a minimum value of around 0.0012. However, when using the scheduler, some cases have failed to converge at minimum value. Therefore, model performance can vary regardless of whether train loss converges.

When using the AdamW optimizer, the training process is largely unaffected by changes in learning rate and batch size. The only difference observed was in the convergence speed; with a larger learning rate and smaller batch size, the number of iterations increased, leading to faster convergence. Furthermore, when a scheduler was used, the difference in trend diminished. Adam and Lion optimizer cases showed a similar trend with changes in learning rate and batch size. Larger batch size and smaller learning rate helped training to be more stable, while the opposite case—smaller batch size and larger learning rate—tended to introduce more variability and instability during training. It was observed that the batch size had a greater impact, likely because the number of iterations is determined by the batch size.

4.2. Mask Prediction Analysis

To analyze the performance of the trained models, we tested the models on the test dataset with both the latest model and the model with the lowest mean epoch loss. Both quantitative and qualitative results are evaluated to determine the best hyperparameter combination.

4.2.1. Quantitative Results

Quantitative comparisons were conducted, as shown in Fig. 8. For each optimizer, the metric scores of the best-performing cases are presented in Table 2, with the highest score highlighted in bold. From Fig. 8, it can be observed that all trained cases show better F1 Scores than the original SAM marked with a red dotted line.

Figure 8. The first row shows the results with a scheduler, while the second row shows the results without one. From left to right, the bar graphs display the F1 Score, IoU, and mIoU metrics. The red dotted line indicates the original SAM model’s metric. Regardless of scheduler use, AdamW consistently outperformed the original SAM model, yielding better results. Additionally, when using a scheduler with AdamW, the performance metrics were more consistent.

Table 2. Quantitative metrics score results with the largest value bolded.

OptimizerSchedulerCase no.F1 ScoreIoUmIoUPrecisionRecallAccuracy
Original SAM0.727630.589430.657200.600190.973980.89239
AdamO90.751480.623260.746370.648690.939110.90998
X80.756440.623660.747820.648010.946880.91008
AdamWO60.776680.655350.864780.662840.978140.91588
X60.778350.651480.841120.659610.981360.91581
LionO90.751640.623560.747190.648420.940050.91008
X80.756300.623450.747760.647890.946740.90998


All the best metrics achieved were with the cases trained by AdamW. When using AdamW, the trained model showed a higher F1 Score of 0.05, regardless of the scheduler used. However, when a scheduler was used, it was observed that the metric values were very closely clustered, which can be interpreted as indicating more consistent training. In the cases of Adam and Lion, the increase in the F1 Score varied depending on the use of a scheduler. When a scheduler was used, the increase was approximately 0.024, while without it, the increase was about 0.029.

The influence of batch size showcased consistent results across all cases, with larger batch sizes resulting in better performance. However, the impact of changes in the learning rate differed between AdamW and the other two optimizers, Adam and Lion. For AdamW, larger learning rates led to better performance, whereas for Adam and Lion, smaller learning rates produced better results. The best F1 Score for Adam was 0.7564 in Case 8, and the worst was 0.74016 in Case 3, both without using a learning rate scheduler. Similarly for Lion, the best F1 Score was 0.7563 in Case 8, and the worst was 0.7438 in Case 3, also without using a scheduler.

4.2.2. Qualitative Results

The original SAM model generated masks that were larger than the actual building sizes, resulting in overlapping areas between the masks. This creates an ambiguous boundary area between the segments, and the exact boundary of the building cannot be known. The trained models exhibited this characteristic to a lesser extent compared to the original SAM, allowing them to better distinguish the nearby building segments. Fig. 9 features the mask generation results from all 27 models trained using the learning rate scheduler, and Fig. 10 features select mask generation results from the mask prediction of 159 validation data images. It was observed that the models were trained to generate more distinct masks between objects in the order of Adam, Lion, and AdamW. This is consistent with the observation that the recall metric for AdamW is higher.

Figure 9. Mask results from all cases using the learning rate scheduler are displayed. The first column shows the original image, label image, and mask predicted by the original SAM model. The mask results from each trained model are arranged in a 3 x 3 grid for each optimizer type.
Figure 10. Five example images of the mask prediction results are shown. The first column displays the original test image from the dataset, while the second column presents the preprocessed label image. The third column shows the mask result from the original SAM model. The remaining columns depict the prediction results from the best-trained model for each optimizer type. In all cases, each building segment was predicted separately and then merged into a single image. The model trained with the AdamW optimizer tends to predict a more accurate form of the building mask. However, models trained with Adam or Lion were able to predict masks for more densely placed buildings.

As shown in the results from Table 2, both Adam and Lion achieved the highest performance in Case 9 and Case 8, respectively, when using and not using a scheduler. In contrast, AdamW showed the best performance in Case 6. However, for AdamW, the variation between cases was minimal, suggesting that selecting a specific case as the best by the evaluation metrics does not hold significant value.

In the case of Adam and Lion, the impact of learning rate and batch size was more pronounced. As the learning rate increased, prediction stability decreased, particularly affecting boundary expression leading to degraded performance. Boundaries became wiggly, and grid-like dots appeared inside the masks. This phenomenon can be attributed to the original SAM’s tendency to predict larger objects, a key difference between natural and satellite images. During training, SAM predicts uniquely shaped building segments over much larger areas, resulting in high losses and large parameter updates when predicting smaller buildings.

As a result, the model became significantly downgraded and failed to predict smaller buildings creating grid-patterned gaps in the masks. Additionally, high learning rates caused unstable boundary shapes, producing complex masks. Therefore, a lower learning rate appears preferable for fine-tuning SAM, as it helps mitigate these issues. With a lower learning rate, parameter updates remain small when mask prediction sharply fails and preventing the model from deteriorating. However, the model training process takes a longer time to converge. With a higher learning rate, trained models performed better in predicting uniquely shaped buildings but failed to predict larger and nonuniquely shaped buildings. As a result, a lower learning rate of 5 × 10–8 yielded better results.

Cases with larger batch sizes exhibited more accurate mask prediction results compared to those with smaller batch sizes. When predicting images with densely placed buildings, cases with smaller batch sizes tend to predict larger masks. This caused building masks to collide and appear as a single structure, similar to the original SAM. Training the model with a larger batch size partially mitigated this issue. However, models trained with AdamW were not able to improve this problem.

In the prediction results obtained by the original model and all the trained models, it was observed that patterned gaps appeared within the building masks. The examples of the inaccuracies are shown in Fig. 11. These inaccuracies cause not only low metrics but also complex boundaries of the building. The degree of this effect varied depending on the combinations of hyperparameters used. When using AdamW as the optimizer, the occurrence of patterns was less frequent. However, in the case of Adam and Lion, a significant increase in both the frequency and area of pattern occurrences was observed in specific cases. Cases 2, 3, and 6 had a higher rate of such errors, with case 3 in particular showing that the mask generation process itself did not proceed correctly.

Figure 11. Example of patterned inaccuracies appearing within the masks from (a) original model, (b) Adam Case 2, and (c) Lion Case 2, respectively. (c) shows the unexpected additional pattern outside the building area.

4.3. Effect of Image Scale

We have tested 3 cases of simply upscaling the image and 3 cases of kinetically upscaling the image with different buffer sizes. The evaluation metric scores are shown in Figs. 12 and 13 and the example of mask prediction is shown in Figs. 14 and 15.

Figure 12. F1 Score and IoU metrics measured when simple upscaling is applied are shown. In all cases, 2x upscaling showed the highest performance, with the models ranking in order of AdamW, Lion, Adam, and the original model. When upscaling was applied, the original model showed the largest performance improvement.
Figure 13. F1 Score and IoU metrics measured when kinetic scaling was applied are shown. Each object was cropped with a selected buffer area. For the trained model, the AdamW optimizer with Case 6 was used for mask prediction. Metrics without scaling are indicated by different colored dotted lines for each model. When upscaling was applied, the original model showed a larger improvement in performance.
Figure 14. The images below show the masks predicted by each model when using a grid-cut, simply upscaled image as input. From the first row, the results are shown for the original, 2x, 4x, and 8x upscaled images, respectively. From left to right, the results were predicted using the best-performing models trained with the original SAM, Adam, AdamW, and Lion optimizers. It was observed that occasional errors occurred in the regions where the predictions were interrupted by grid splits. When upscaled by 2x, the results showed that even in densely packed building areas, individual objects could be clearly distinguished.
Figure 15. Mask prediction from three example images when cropping and buffering are applied. In the first row, the original image and label image are displayed. The second row shows the results of mask prediction without any preprocessing of the image. Starting from the third row, the results are shown in sequence for mask predictions where the image was cropped with buffers of 1 pixel, 50 pixels, and 100 pixels, respectively, before being used as input. In the odd-numbered columns, the results are from the original SAM model, while the even-numbered columns show the results from the model trained using the AdamW and Case 6 settings. When a narrow buffer was applied, it was observed that the original model struggled to make accurate predictions. In contrast, the trained model was still able to make predictions even in this case, and with a 100-pixel buffer, it was capable of generating highly accurate masks.

To examine the relationship between SAM’s performance and object size, we adopted a similar approach to Ren et al. (2024). We first evaluated the segmentation performance on images that had been upscaled by a factor of 2n (n=1,2,3). All trained models were evaluated using upscaled test images. Among the different optimizer types, Case 6 demonstrated the highest performance with AdamW, while Case 9 achieved the best performance with either Adam or Lion as the optimizer. A comparison was made with the models that achieved the highest performance for each optimizer. When the images were upscaled by a factor of 2 and 4, both metrics showed improved performance compared to the original scale across all three trained models.

However, when upscaled by a factor of 8, performance noticeably decreased. For the original model, the F1 Score improved to 0.7641 from 0.7276 (an increase of 0.0365) when the image was upscaled by a factor of 2, while the IoU increased significantly to 0.6393 from 0.5894 (an increase of 0.0499). Similarly, for the model trained using AdamW, the F1 Score increased to 0.7909 from 0.7767, and the IoU rose to 0.6751 from 0.6553, outperforming all previously trained models. Across all optimizers, the best performance was observed when the image scale was between 2x and 4x.

As the image scale increases, the model tends to misclassify nonbuilding areas. Additionally, as shown in Fig. 12, the prediction performance along the boundaries deteriorated during the process of reassembling the images that had been divided into grids. When upscaling was applied, it was observed that the performance improvement of the trained model was more limited compared to the original model.

Next, we evaluated the original model and the model trained with AdamW under the Case 6 setting by varying the buffer size. The buffer size determines the extent of the building included in the image. Three buffer sizes—1 pixel, 50 pixels, and 100 pixels— were assessed. Both the original model and the model trained with AdamW demonstrated improved performance in the order of applying buffers of 100 pixels, 50 pixels, and 1 pixel around the objects. However, similarly to the simple upscaling results, as the buffer size increased, the performance gap between the trained model and the original model decreased. With the kinetic cropping, the F1 Score reached 0.8694 from 0.7515 increments of 0.1179, when using the AdamW-trained model Case 6. For the original SAM model, the F1 Score has improved to 0.8499 from 0.7276 and IoU has increased to 0.5894 from 0.7438.

Visual comparisons also demonstrated an improvement in mask generation performance. When predicting areas with densely located buildings, it was observed that using the full image led to more accurate extraction of building boundaries, enabling the mask outputs to be generated without overlapping. However, the patterned inaccuracies tend to appear more in the trained models. Figs. 14 and 15 illustrate the generated prediction masks from various upscaling methods applied.

5. Conclusions

In this study, we optimized the application of the SAM model specializing it for satellite images through two primary approaches, given that SAM is predominantly pretrained on natural images. First, optimizing the hyperparameters allows SAM to improve segmentation quality on satellite images, specifically those from the KOMPSAT dataset in this study. When training SAM, the problem of occasional prediction of results with gap should be considered. It is crucial to develop strategies that minimize the occurrence of this phenomenon while enhancing building segmentation performance. Second, we applied SAM to images that were cropped object-by-object and buffered to further enhance performance.

Initially, we conducted a comparative analysis of the loss variation during training and the predictive performance of the trained models based on different hyperparameter combinations. It was confirmed that performance improved most significantly when using the AdamW optimizer. Specifically, the F1 Score increased from 0.7276 to 0.7767, a rise of 0.0491, while the IoU improved from 0.5894 to 0.6554, a gain of 0.066. When using AdamW, there were no significant differences related to learning rate and batch size. However, it was observed that even with a large batch size and high learning rate, stable and fast training was achieved when applying a learning rate scheduler during the training process. Additionally, it was confirmed that training could be conducted without increasing the occurrence of inaccuracies during the learning process, following the proposed hyperparameter settings.

To further examine the impact of image scaling, we compared performance when images were scaled by factors of 2x, 4x, and 8x, as well as when kinetic cropping and adding buffer were applied to objects. The results showed that applying kinetic cropping combined with a 100-pixel buffer on a model trained with AdamW led to a significant increase in the F1 Score, rising from 0.7767 to 0.8694, a substantial improvement of 0.1179. Performance enhancements visually confirmed that exhibiting images with densely populated buildings can be extracted without overlapping.

Despite these advancements, this study has certain limitations. There are still instances where SAM creates inaccuracies within the mask during mask generation. This issue is addressable through the application of HQ-SAM (Ke et al., 2023). Additionally, within a single epoch, the loss value decreased rapidly. Adjusting the dataset size will allow the loss value to either continuously decrease or remain stable over larger epochs. Moreover, removing items from the dataset that significantly reduce the learning effect is likely to enhance the effectiveness of training. Further research can explore improving the learning effect through boundary-aware learning during training, considering the direction of building segments and adjusting the image orientation.

Acknowledgments

This work is supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2022-00155763), and Korea Ministry of Land, Infrastructure and Transport (MOLIT) as 「Innovative Talent Education Program for Smart City」. The Institute of Engineering Research at Seoul National University provided research facilities for this work. This research (paper) used datasets from ‘The Open AI Dataset Project (AI-Hub, S. Korea)’. All data information can be accessed through ‘AI-Hub (www.aihub.or.kr)’.

Conflict of Interest

No potential conflict of interest relevant to this article was reported.

Fig 1.

Figure 1.The structure of SAM proposed by Kirillov et al. (2023). The image encoder represents the image in high-level feature form, and the prompt encoder translates various types of prompts into a form that the SAM can understand. The mask decoder then uses features from encoders to predict the mask.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 2.

Figure 2.Process of producing bounding box prompt. (Step 1) The left image shows the label coordinates with the building segment. (Step 2) The middle image shows the exact bounding box created by calculating maximum and minimum coordinates. (Step 3) The right shows the final bounding box with a buffer area.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 3.

Figure 3.Process of preprocessing inputs. (Step 1) Calculate the image crop area with select buffer pixels. (Step 2) Crop the image with the area calculated in Step 1 and transform the bounding box prompt to match the new coordinate system. The red box and purple box indicate the original bounding box and crop area, respectively. The output in the form on the far right was used as the model’s input.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 4.

Figure 4.Sample image of the dataset. (a) shows the original train image, and (b) shows the corresponding label image. The label image is expressed in five colors for each type of building; however, in this study, we do not utilize the types of buildings.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 5.

Figure 5.Prior test with 100 epochs. Case 9 setting was used for all optimizer types. We can observe that the loss value converges in an early epoch in every test case.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 6.

Figure 6.Visualization of the label data preprocessing for fine-tuning. (a) and (b) display the original satellite image and its corresponding label image. (c) illustrates the preprocessed individual segment label images, where the example image contains 259 buildings, resulting in 259 individual label images prepared for fine-tuning. (d) shows the merged label image, combining all individual labels into a single image. The yellow boxes highlight the magnified portions, offering a closer view of the overlap between labels.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 7.

Figure 7.The mean epoch loss value for each training case is shown for the Adam, AdamW, and Lion optimizers, respectively. Cases that do not use the learning rate scheduler are represented with solid lines, while those that do use the scheduler are shown with dotted lines. The same set of cases is represented by the same color. When using AdamW, the training loss converges to a lower value and the process is stable. However, for the Adam and Lion optimizers, the training loss diverges to higher values in most cases without a scheduler, and even in some cases with a scheduler, indicating that the training process was not stable when using Adam and Lion.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 8.

Figure 8.The first row shows the results with a scheduler, while the second row shows the results without one. From left to right, the bar graphs display the F1 Score, IoU, and mIoU metrics. The red dotted line indicates the original SAM model’s metric. Regardless of scheduler use, AdamW consistently outperformed the original SAM model, yielding better results. Additionally, when using a scheduler with AdamW, the performance metrics were more consistent.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 9.

Figure 9.Mask results from all cases using the learning rate scheduler are displayed. The first column shows the original image, label image, and mask predicted by the original SAM model. The mask results from each trained model are arranged in a 3 x 3 grid for each optimizer type.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 10.

Figure 10.Five example images of the mask prediction results are shown. The first column displays the original test image from the dataset, while the second column presents the preprocessed label image. The third column shows the mask result from the original SAM model. The remaining columns depict the prediction results from the best-trained model for each optimizer type. In all cases, each building segment was predicted separately and then merged into a single image. The model trained with the AdamW optimizer tends to predict a more accurate form of the building mask. However, models trained with Adam or Lion were able to predict masks for more densely placed buildings.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 11.

Figure 11.Example of patterned inaccuracies appearing within the masks from (a) original model, (b) Adam Case 2, and (c) Lion Case 2, respectively. (c) shows the unexpected additional pattern outside the building area.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 12.

Figure 12.F1 Score and IoU metrics measured when simple upscaling is applied are shown. In all cases, 2x upscaling showed the highest performance, with the models ranking in order of AdamW, Lion, Adam, and the original model. When upscaling was applied, the original model showed the largest performance improvement.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 13.

Figure 13.F1 Score and IoU metrics measured when kinetic scaling was applied are shown. Each object was cropped with a selected buffer area. For the trained model, the AdamW optimizer with Case 6 was used for mask prediction. Metrics without scaling are indicated by different colored dotted lines for each model. When upscaling was applied, the original model showed a larger improvement in performance.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 14.

Figure 14.The images below show the masks predicted by each model when using a grid-cut, simply upscaled image as input. From the first row, the results are shown for the original, 2x, 4x, and 8x upscaled images, respectively. From left to right, the results were predicted using the best-performing models trained with the original SAM, Adam, AdamW, and Lion optimizers. It was observed that occasional errors occurred in the regions where the predictions were interrupted by grid splits. When upscaled by 2x, the results showed that even in densely packed building areas, individual objects could be clearly distinguished.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Fig 15.

Figure 15.Mask prediction from three example images when cropping and buffering are applied. In the first row, the original image and label image are displayed. The second row shows the results of mask prediction without any preprocessing of the image. Starting from the third row, the results are shown in sequence for mask predictions where the image was cropped with buffers of 1 pixel, 50 pixels, and 100 pixels, respectively, before being used as input. In the odd-numbered columns, the results are from the original SAM model, while the even-numbered columns show the results from the model trained using the AdamW and Case 6 settings. When a narrow buffer was applied, it was observed that the original model struggled to make accurate predictions. In contrast, the trained model was still able to make predictions even in this case, and with a 100-pixel buffer, it was capable of generating highly accurate masks.
Korean Journal of Remote Sensing 2024; 40: 551-568https://doi.org/10.7780/kjrs.2024.40.5.1.11

Table 1 . Specification of KOMPSAT-3/3A satellites (Committee on Earth Observation Satellites, 2024).

ParameterKOMPSAT-3KOMPSAT-3A
Launch dataMay 18, 2012March 25, 2015
Ground sample distance0.70 m0.55 m
Altitude625 km528 km
Inclination98.13°97.513°
Orbital period98.5 minutes95.2 minutes
Number of revolutions14.6 revolutions/day15.1 revolutions/day
Repeated ground track28 days / 423 revolutions28 days / 409 revolutions

Table 2 . Test case number for combinations of learning rate and batch size.

OptimizersAdam, AdamW, LionSchedulerStepLR
Batch sizeLearning rate
5×10–81×10–72×10–7
4Case 1Case 2Case 3
8Case 4Case 5Case 6
12Case 7Case 8Case 9

The case number applies identically for every optimizer tested..


Table 2. Quantitative metrics score results with the largest value bolded.

OptimizerSchedulerCase no.F1 ScoreIoUmIoUPrecisionRecallAccuracy
Original SAM0.727630.589430.657200.600190.973980.89239
AdamO90.751480.623260.746370.648690.939110.90998
X80.756440.623660.747820.648010.946880.91008
AdamWO60.776680.655350.864780.662840.978140.91588
X60.778350.651480.841120.659610.981360.91581
LionO90.751640.623560.747190.648420.940050.91008
X80.756300.623450.747760.647890.946740.90998

References

  1. Acharya, T. D., Yang, I. T., and Lee, D. H., 2016. Land cover classification using a KOMPSAT-3A multi-spectral satellite image. Applied Sciences, 6(11), 371. https://doi.org/10.3390/app6110371
  2. Badrinarayanan, V., Handa, A., and Cipolla, R., 2015. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293. https://doi.org/10.48550/arXiv.1505.07293
  3. Chen, K., Liu, C., Chen, H., Zhang, H., Li, W., and Zou, Z., et al, 2024a. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Transactions on Geoscience and Remote Sensing, 62, 4701117. https://doi.org/10.1109/TGRS.2024.3356074
  4. Chen, X., Liang, C., Huang, D., Real, E., Wang, K., and Pham, H., et al, 2024b. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675. https://doi.org/10.48550/arXiv.2302.06675
  5. Committee on Earth Observation Satellites, 2024. CEOS EO handbook - Mission summary. Available online: https://database.eohandbook.com/database/missionsummary.aspx?missionID=698&utm_source=eoportal&utm_content=kompsat-3a (accessed on July 10, 2024)
  6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., and Unterthiner, T., et al, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929
  7. Guo, Y., Liu, Y., Georgiou, T., and Lew, M. S., 2018. A review of semantic segmentation using deep neural networks. International Journal of Multimedia Information Retrieval, 7, 87-93. https://doi.org/10.1007/s13735-017-0141-z
  8. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., and Downey, D., et al, 2020. Don't stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964. https://doi.org/10.48550/arXiv.2004.10964
  9. Huang, Z., Jing, H., Liu, Y., Yang, X., Wang, Z., and Liu, X., et al, 2024. Segment anything model combined with multi-scale segmentation for extracting complex cultivated land parcels in high-resolution remote sensing images. Remote Sensing, 16(18), 3489. https://doi.org/10.3390/rs16183489
  10. Ke, L., Ye, M., Danelljan, M., Tai, Y. W., Tang, C. K., and Yu, F., 2023. Segment anything in high quality. arXiv preprint arXiv:2306.01567. https://doi.org/10.48550/arXiv.2306.01567
  11. Khatua, A., Bhattacharya, A., Goswami, A. K., and Aithal, B. H., 2024. Developing approaches in building classification and extraction with synergy of YOLOV8 and SAM models. Spatial Information Research, 32, 511-530. https://doi.org/10.1007/s41324-024-00574-0
  12. Kim, S., and Kim, T., 2023. Automated extraction of orthorectified building layer from high-resolution satellite images. Korean Journal of Remote Sensing, 39(3), 339-353. https://doi.org/10.7780/kjrs.2023.39.3.7
  13. Kingma, D. P., and Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. https://doi.org/10.48550/arXiv.1412.6980
  14. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., and Gustafson, L., et al, 2023. Segment anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, Oct. 1-6, pp. 4015-4026. https://doi.org/10.1109/iccv51070.2023.00371
  15. Krizhevsky, A., Sutskever, I., and Hinton, G. E., 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90. https://doi.org/10.1145/3065386
  16. Long, J., Shelhamer, E., and Darrell, T., 2016. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1605.06211. https://doi.org/10.48550/arXiv.1605.06211
  17. Loshchilov, I., and Frank, H., 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. https://doi.org/10.48550/arXiv.1711.05101
  18. Mazurowski, M. A., Dong, H., Gu, H., Yang, J., Konz, N., and Zhang, Y., 2023. Segment anything model for medical image analysis: An experimental study. Medical Image Analysis, 89, 102918. https://doi.org/10.1016/j.media.2023.102918
  19. Ren, S., Luzi, F., Lahrichi, S., Kassaw, K., Collins, L. M., and Bradbury, K., et al, 2024. Segment anything, from space?. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, Jan. 3-8, pp. 8355-8365. https://doi.org/10.1109/wacv57701.2024.00817
  20. Ronneberger, O., Fischer, P., and Brox, T., 2015. U-Net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A., (eds.), Medical image computing and computer-assisted intervention - MICCAI 2015, Springer, pp. 234-241. https://doi.org/10.1007/978-3-319-24574-4_28
  21. Sun, K., Xiao, B., Liu, D., and Wang, J., 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, June 15-20, pp. 5693-5703. https://doi.org/10.1109/cvpr.2019.00584
  22. Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., and Singhal, U., et al, 2020. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739. https://doi.org/10.48550/arXiv.2006.10739
  23. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., and Gomez, A. N., et al, 2017. Attention is all you need. arXiv preprint arXiv:1706.03762. https://doi.org/10.48550/arXiv.1706.03762
  24. Verde, N., Mallinis, G., Tsakiri-Strati, M., Georgiadis, C., and Patias, P., 2018. Assessment of radiometric resolution impact on remote sensing data classification accuracy. Remote Sensing, 10(8), 1267. https://doi.org/10.3390/rs10081267
  25. Wang, L., Fang, S., Meng, X., and Li, R., 2022a. Building extraction with vision transformer. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-11. https://doi.org/10.1109/TGRS.2022.3186634
  26. Wang, L., Li, R., Zhang, C., Fang, S., Duan, C., and Meng, X., et al, 2022b. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 190, 196-214. https://doi.org/10.1016/j.isprsjprs.2022.06.008
  27. Xu, Y., Wu, L., Xie, Z., and Chen, Z., 2018. Building extraction in very high resolution remote sensing imagery using deep learning and guided filters. Remote Sensing, 10(1), 144. https://doi.org/10.3390/rs10010144
  28. Zheng, D., Li, S., Fang, F., Zhang, J., Feng, Y., and Wan, B., et al, 2023. Utilizing bounding box annotations for weakly supervised building extraction from remote-sensing images. IEEE Transactions on Geoscience and Remote Sensing, 61, 1-17. https://doi.org/10.1109/tgrs.2023.3271986
  29. Zhou, X., Liang, F., Chen, L., Liu, H., Song, Q., and Vivone, G., et al, 2024. MeSAM: Multiscale enhanced segment anything model for optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 62, 5623515. https://doi.org/10.1109/tgrs.2024.3398038
KSRS
October 2024 Vol. 40, No.5, pp. 419-879

Share

  • line

Related Articles

Korean Journal of Remote Sensing