Abstract
This study was conducted to evaluate the potential of a multimodal model for land cover classification. The performance of the Clipseg multimodal model was compared with two unimodal models including Convolutional Neural Network (CNN)-based Unet and Transformer-based Segformer for land cover classification. Using orthophotos of two areas (Area1 and Area2) in Wonju City, Gangwon Province, classification was performed for seven land cover categories (Forest, Cropland, Grassland, Wetland, Settlement, Bare Land, and Forestry-managed Land). The results showed that the Clipseg model demonstrated the highest generalization performance in new environments, achieving the highest accuracy among the three models with an Overall Accuracy of 83.9% and Kappa of 0.72 in the test area (Area2). It performed particularly well in classifying Forest (F1-Score 94.7%), Cropland (78.0%), and Settlement (78.4%). While Unet and Segformer models showed high accuracy in the training area (Area1), they exhibited limitations in generalization ability with accuracy decreases of 29% and 20% respectively in the test area. The Clipseg model required the most parameters (approximately 150 million) and the longest training time (10 hours 48 minutes) but showed stable performance in new environments. In contrast, Segformer achieved considerable accuracy with the least parameters (about 16 million) and the shortest training time (3 hours 21 minutes), demonstrating its potential for use in resource-limited environments. This study shows that image-text-based multimodal models have a high potential for land cover classification. Their superior generalization ability in new environments suggests they can be effectively applied to land cover classification in various regions. Future research could further improve classification accuracy through model structure improvements, addressing data imbalances, and additional validation in diverse environments.