Evaluasi Kinerja CNN dan Vision Transformer pada Klasifikasi Citra Resolusi Tinggi Berbasis Deep Learning

Authors

  • Humuntal Rumapea Universitas Methodist Indonesia

DOI:

https://doi.org/10.46880/jmika.Vol9No2.pp372-379

Keywords:

Deep Learning, Convolutional Neural Network (CNN), Vision Transformer (ViT), Image Classification, High-Resolution Images

Abstract

This study aims to evaluate and compare the performance of Convolutional Neural Networks (CNN) and Vision Transformers (ViT) in high-resolution image classification based on deep learning. The dataset consists of high-resolution images that undergo preprocessing and data augmentation, and is divided into training, validation, and testing sets. The CNN models used include ResNet50 and EfficientNet as baselines, while Vision Transformer is employed as a comparative model utilizing a self-attention mechanism. Performance evaluation is conducted using metrics such as accuracy, precision, recall, F1-score, as well as training and inference time. The results indicate that Vision Transformer achieves superior classification performance compared to CNN, with an accuracy of up to 93.85%. However, CNN demonstrates better computational efficiency with lower training and inference time. Furthermore, increasing image resolution improves the performance of both models, albeit at the cost of higher computational complexity, particularly for Vision Transformer. This study highlights a trade-off between accuracy and efficiency, suggesting that model selection should be aligned with specific application requirements.

References

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. http://arxiv.org/abs/2010.11929

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 770–778.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539

Li, Z., Hu, J., Wu, K., Miao, J., Zhao, Z., & Wu, J. (2024). Local feature acquisition and global context understanding network for very high-resolution land cover classification. Scientific Reports, 14(1), 12597. https://doi.org/10.1038/s41598-024-63363-7

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022.

Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11976–11986.

Maurício, J., Domingues, I., & Bernardino, J. (2023). Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Applied Sciences, 13(9), 5521. https://doi.org/10.3390/app13095521

Rawat, W., & Wang, Z. (2017). Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Computation, 29(9), 2352–2449. https://doi.org/10.1162/neco_a_00990

Rodrigo, M., Cuevas, C., & García, N. (2024). Comprehensive comparison between vision transformers and convolutional neural networks for face recognition tasks. Scientific Reports, 14(1), 21392. https://doi.org/10.1038/s41598-024-72254-w

Takahashi, S., Sakaguchi, Y., Kouno, N., Takasawa, K., Ishizu, K., Akagi, Y., Aoyama, R., Teraya, N., Bolatkan, A., Shinkai, N., Machino, H., Kobayashi, K., Asada, K., Komatsu, M., Kaneko, S., Sugiyama, M., & Hamamoto, R. (2024). Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review. Journal of Medical Systems, 48(1), 84. https://doi.org/10.1007/s10916-024-02105-8

Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning , 6105–6114.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning, 10347–10357.

Wang, Y., Deng, Y., Zheng, Y., Chattopadhyay, P., & Wang, L. (2025). Vision Transformers for Image Classification: A Comparative Survey. Technologies, 13(1), 32. https://doi.org/10.3390/technologies13010032

Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV) , 13–19.

Published

2025-10-31

Issue

Section

METHOMIKA: Jurnal Manajemen Informatika & Komputersisasi Akuntansi