. 2023 Oct 25:278:110907.

doi: 10.1016/j.knosys.2023.110907. Epub 2023 Aug 18.

Medical Image Retrieval via Nearest Neighbor Search on Pre-trained Image Features

Deepak Gupta¹, Russell Loane¹, Soumya Gayen¹, Dina Demner-Fushman¹

Affiliations

PMID: 37780058
PMCID: PMC10540469
DOI: 10.1016/j.knosys.2023.110907

Medical Image Retrieval via Nearest Neighbor Search on Pre-trained Image Features

Deepak Gupta et al. Knowl Based Syst. 2023.

. 2023 Oct 25:278:110907.

doi: 10.1016/j.knosys.2023.110907. Epub 2023 Aug 18.

Authors

Deepak Gupta¹, Russell Loane¹, Soumya Gayen¹, Dina Demner-Fushman¹

Affiliation

¹ Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

PMID: 37780058
PMCID: PMC10540469
DOI: 10.1016/j.knosys.2023.110907

Abstract

Nearest neighbor search, also known as NNS, is a technique used to locate the points in a high-dimensional space closest to a given query point. This technique has multiple applications in medicine, such as searching large medical imaging databases, disease classification, and diagnosis. However, when the number of points is significantly large, the brute-force approach for finding the nearest neighbor becomes computationally infeasible. Therefore, various approaches have been developed to make the search faster and more efficient to support the applications. With a focus on medical imaging, this paper proposes DenseLinkSearch (DLS), an effective and efficient algorithm that searches and retrieves the relevant images from heterogeneous sources of medical images. Towards this, given a medical database, the proposed algorithm builds an index that consists of pre-computed links of each point in the database. The search algorithm utilizes the index to efficiently traverse the database in search of the nearest neighbor. We also explore the role of medical image feature representation in content-based medical image retrieval tasks. We propose a Transformer-based feature representation technique that outperformed the existing pre-trained Transformer-based approaches on benchmark medical image retrieval datasets. We extensively tested the proposed NNS approach and compared the performance with state-of-the-art NNS approaches on benchmark datasets and our created medical image datasets. The proposed approach outperformed the existing approaches in terms of retrieving accurate neighbors and retrieval speed. In comparison to the existing approximate NNS approaches, our proposed DLS approach outperformed them in terms of lower average time per query and ≥ 99% $R@10$ on 11 out of 13 benchmark datasets. We also found that the proposed medical feature representation approach is better for representing medical images compared to the existing pre-trained image models. The proposed feature extraction strategy obtained an improvement of 9.37%, 7.0%, and 13.33% in terms of P@5, P@10, and P@20, respectively, in comparison to the best-performing pre-trained image model. The source code and datasets of our experiments are available at https://github.com/deepaknlp/DLS.

Keywords: Content-based image retrieval; Image feature representation; Indexing; Nearest neighbor search; Searching in High Dimensions.

PubMed Disclaimer

Conflict of interest statement

Declarations of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Figure 1:**
Illustration of the indexing algorithm. Subfigure **(a)** demonstrates the image vector representation in high-dimensional space. The indexing algorithm considers each vector representation as a vertex and the distance between them as the edge of the index graph. The indexing starts with a random vertex and calls the vertex a node. At any point of time during index construction, a vector representation that has not become a node of the index graph is shown as . The algorithm initializes a heap data structure of size $𝓚_{i n d e x} = 2$ and a list. The heap holds links to the nearest vectors known so far. The list holds links to other vectors that consider this vector near. The top link in a heap defines the neighborhood radius, also known as the near distance. Subfigure **(b)** represents a snapshot of the index after the data point E became a node, and its heap and list are updated based on distances between the node and its neighbors. The existing nodes and edges are shown as respectively. The recently processed node E and added edges are shown as respectively. Subfigure **(c)** demonstrates each node’s heap and list items and the index graph with nodes and edges after the index is built. The created links and distances are shown in subfigure **(d)**. The final index contains the descend and spread links for each node in the dataset. The descend links contain the endpoints vertices and distances from the heap just before the vertex becomes a node while building the indexes. Similarly, the spread links of a given vertex store the endpoints vertices and distances from the heap and list once the algorithm ends.

formula image — **Figure 1:**
Illustration of the indexing algorithm. Subfigure **(a)** demonstrates the image vector representation in high-dimensional space. The indexing algorithm considers each vector representation as a vertex and the distance between them as the edge of the index graph. The indexing starts with a random vertex and calls the vertex a node. At any point of time during index construction, a vector representation that has not become a node of the index graph is shown as . The algorithm initializes a heap data structure of size $𝓚_{i n d e x} = 2$ and a list. The heap holds links to the nearest vectors known so far. The list holds links to other vectors that consider this vector near. The top link in a heap defines the neighborhood radius, also known as the near distance. Subfigure **(b)** represents a snapshot of the index after the data point E became a node, and its heap and list are updated based on distances between the node and its neighbors. The existing nodes and edges are shown as respectively. The recently processed node E and added edges are shown as respectively. Subfigure **(c)** demonstrates each node’s heap and list items and the index graph with nodes and edges after the index is built. The created links and distances are shown in subfigure **(d)**. The final index contains the descend and spread links for each node in the dataset. The descend links contain the endpoints vertices and distances from the heap just before the vertex becomes a node while building the indexes. Similarly, the spread links of a given vertex store the endpoints vertices and distances from the heap and list once the algorithm ends.

**Figure 2:**
pseudocode of the proposed indexing algorithm.

**Figure 3:**
The demonstration of DenseLinkSearch that utilizes the index built as shown in Fig. 1. The DenseLinkSearch operates on the Descend and Spread stages to find the $𝓚_{s e a r c h}$ nearest neighbors. The input to the search algorithm is a query vector and the pre-built index with the links. In the subfigure (a), the search starts with a random node (here ) and compute the distance between the query and the node . In each step of the Descend stage, the algorithm keeps track of the closest vector to the query. Also, a heap $𝓗$ of size $𝓚_{s e a r c h}$ is maintained to track $𝓚_{s e a r c h}$ nearest neighbors throughout the search process. Initially, the node is the nearest neighbor to the query, therefore, the node is pushed into the heap $𝓗$ . In the subfigure (b), the distance between the query vector and the links (, , and ) of are computed, and the closest vector is updated accordingly. The subfigure (c) shows that the closest node is chosen for the next stage of the Descend. Since the distances (23 and 32) between the query and the links of the node are greater than distance (18) between query and node , the search switches to the Spread stage. The Spread stage is demonstrated in subfigure (d), where we aim to compute the distance between the query and the links ( and ) of node , those have already been computed. At the end of the algorithm, the $𝓚_{s e a r c h}$ nearest neighbors can be found in the heap $𝓗$ .

**Figure 4:**
Kernel density estimate (KDE) plot to visualize the sample distribution of the **(a)** Artificial and **(b)** Faces datasets.

**Figure 5:**
The effect of $𝓚_{i n d e x}$ and $𝓚_{s e a r c h}$ on ATPQ (a) and Recall@10 (b) on 1,000,000 samples of OpenI-ConvNeXt dataset.

**Figure 6:**
The effect of $𝓚_{i n d e x}$ and $𝓚_{s e a r c h}$ on ATPQ (a) and Recall@10 (b) on 1,000,000 samples of the OpenI-ResNet dataset.

**Figure 7:**
Comparison of the proposed DLS approach with HNSW with Pareto front in terms of query per second and $R@10$ .

**Figure 8:**
Performance of the proposed DLS approach with Pareto front in terms of query per second and $R@10$ on SIFT and GIST datasets.

**Figure 9:**
Effect of the sample size on the average time per query for OpenI-ResNet dataset.

**Figure 10:**
Effect of the sample size on the average time per query for OpenI-ConvNeXt dataset.

**Figure 11:**
The effect of dataset size on average time per query and the number of distance calculations done by DenseLinkSearch to find the nearest neighbors.

**Figure 12:**
Comparison of the nearest images retrieved from the existing OpenI system **(a)** and our proposed system **(b)**. The first image in each of the subfigures is the query image, and the remaining are the nearest images in the decreasing order of their similarity to the query image.

See this image and copyright information in PMC

References

1. Almalawi AM, Fahad A, Tari Z, Cheema MA, & Khalil I (2015). k nnvwc: An efficient k-nearest neighbors approach based on various-widths clustering. IEEE Transactions on Knowledge and Data Engineering, 28, 68–81.
1. Antani S, Long LR, & Thoma GR (2004). Content-based image retrieval for large biomedical image archives. In MEDINFO 2004 (pp. 829–833). IOS Press. - PubMed
1. Antani SK, Deserno TM, Long LR, Güld MO, Neve L, & Thoma GR (2007). Interfacing global and local cbir systems for medical image retrieval. In Bildverarbeitung für die Medizin 2007 (pp. 166–171). Springer.
1. Ba JL, Kiros JR, & Hinton GE (2016). Layer normalization. arXiv preprint arXiv:1607.06450
1. Babenko A, & Lempitsky V (2014). The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence, 37, 1247–1260. - PubMed

Grants and funding

Z99 LM999999/ImNIH/Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Medical Image Retrieval via Nearest Neighbor Search on Pre-trained Image Features

Affiliation

Medical Image Retrieval via Nearest Neighbor Search on Pre-trained Image Features

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources