Evaluasi Kinerja CNN dan Vision Transformer pada Klasifikasi Citra Resolusi Tinggi Berbasis Deep Learning
DOI:
https://doi.org/10.46880/jmika.Vol9No2.pp372-379Keywords:
Deep Learning, Convolutional Neural Network (CNN), Vision Transformer (ViT), Image Classification, High-Resolution ImagesAbstract
This study aims to evaluate and compare the performance of Convolutional Neural Networks (CNN) and Vision Transformers (ViT) in high-resolution image classification based on deep learning. The dataset consists of high-resolution images that undergo preprocessing and data augmentation, and is divided into training, validation, and testing sets. The CNN models used include ResNet50 and EfficientNet as baselines, while Vision Transformer is employed as a comparative model utilizing a self-attention mechanism. Performance evaluation is conducted using metrics such as accuracy, precision, recall, F1-score, as well as training and inference time. The results indicate that Vision Transformer achieves superior classification performance compared to CNN, with an accuracy of up to 93.85%. However, CNN demonstrates better computational efficiency with lower training and inference time. Furthermore, increasing image resolution improves the performance of both models, albeit at the cost of higher computational complexity, particularly for Vision Transformer. This study highlights a trade-off between accuracy and efficiency, suggesting that model selection should be aligned with specific application requirements.
References
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. http://arxiv.org/abs/2010.11929
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 770–778.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
Li, Z., Hu, J., Wu, K., Miao, J., Zhao, Z., & Wu, J. (2024). Local feature acquisition and global context understanding network for very high-resolution land cover classification. Scientific Reports, 14(1), 12597. https://doi.org/10.1038/s41598-024-63363-7
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022.
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11976–11986.
Maurício, J., Domingues, I., & Bernardino, J. (2023). Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Applied Sciences, 13(9), 5521. https://doi.org/10.3390/app13095521
Rawat, W., & Wang, Z. (2017). Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Computation, 29(9), 2352–2449. https://doi.org/10.1162/neco_a_00990
Rodrigo, M., Cuevas, C., & García, N. (2024). Comprehensive comparison between vision transformers and convolutional neural networks for face recognition tasks. Scientific Reports, 14(1), 21392. https://doi.org/10.1038/s41598-024-72254-w
Takahashi, S., Sakaguchi, Y., Kouno, N., Takasawa, K., Ishizu, K., Akagi, Y., Aoyama, R., Teraya, N., Bolatkan, A., Shinkai, N., Machino, H., Kobayashi, K., Asada, K., Komatsu, M., Kaneko, S., Sugiyama, M., & Hamamoto, R. (2024). Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review. Journal of Medical Systems, 48(1), 84. https://doi.org/10.1007/s10916-024-02105-8
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning , 6105–6114.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning, 10347–10357.
Wang, Y., Deng, Y., Zheng, Y., Chattopadhyay, P., & Wang, L. (2025). Vision Transformers for Image Classification: A Comparative Survey. Technologies, 13(1), 32. https://doi.org/10.3390/technologies13010032
Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV) , 13–19.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Humuntal Rumapea

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.










