Medical Image Retrieval Based on Ensemble Learning using Convolutional Neural Networks and Vision Transformers

Yahya, Ahmed; Khaled, Dalya; Al-Azzawi, Waleed; Alghazali, Tawfeeq; Jabr, H. Sabah; Abdulla, R. Madhat; Al-Maeeni, M. Kadhim Abbas; Alwan, N. Hussin; Najeeb, S. Saad; Falih, Kh. T.

doi:10.30486/mjee.2022.696500

Original Article

Medical Image Retrieval Based on Ensemble Learning using Convolutional Neural Networks and Vision Transformers

Authors

Department of nursing, Al-Hadba University College, Iraq
Al-Manara College for Medical Sciences, Maysan, Iraq
Medical Lab. Techniques department, College of Medical Technology, Al-Farahidi University, Iraq
College of Media, Department of Journalism, The Islamic University in Najaf, Najaf, Iraq
Anesthesia Techniques Department, Al-Mustaqbal University College, Babylon, Iraq
The University of Mashreq, Baghdad, Iraq
Al-Nisour University College, Baghdad, Iraq
Department of Nursing, Al-Zahrawi University College, Karbala, Iraq
Al-Esraa University College, Baghdad, Iraq
New Era and Development in Civil Engineering Research Group, Scientific Research Center, Al-Ayen University, Thi-Qar, Iraq.

Abstract

The rapid increase in the number of medical image repositories nowadays has led to problems in managing and retrieving medical visual data. This has proved the necessity of Content-Based Image Retrieval (CBIR) with the aim of facilitating the investigation of such medical imagery. One of the most serious challenges that require special attention is the representational quality of the embeddings generated by the retrieval pipelines. These embeddings should include global and local features to obtain more useful information from the input data. To fill this gap, in this paper, we propose a CBIR framework that utilizes the power of deep neural networks to efficiently classify and fetch the most related medical images with respect to a query image. Our proposed model is based on combining Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) and learns to capture both the locality and also the globality of high-level feature maps. Our method is trained to encode the images in the database and outputs a ranking list containing the most similar image to the least similar one to the query. To conduct our experiments, an intermodal dataset containing ten classes with five different modalities is used to train and assess the proposed framework. The results show an average classification accuracy of 95.32 % and a mean average precision of 0.61. Our proposed framework can be very effective in retrieving multimodal medical images with the images of different organs in the body.