Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 25;11(1):539.
doi: 10.1038/s41597-024-03359-0.

REAL-Colon: A dataset for developing real-world AI applications in colonoscopy

Affiliations

REAL-Colon: A dataset for developing real-world AI applications in colonoscopy

Carlo Biffi et al. Sci Data. .

Abstract

Detection and diagnosis of colon polyps are key to preventing colorectal cancer. Recent evidence suggests that AI-based computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems can enhance endoscopists' performance and boost colonoscopy effectiveness. However, most available public datasets primarily consist of still images or video clips, often at a down-sampled resolution, and do not accurately represent real-world colonoscopy procedures. We introduce the REAL-Colon (Real-world multi-center Endoscopy Annotated video Library) dataset: a compilation of 2.7 M native video frames from sixty full-resolution, real-world colonoscopy recordings across multiple centers. The dataset contains 350k bounding-box annotations, each created under the supervision of expert gastroenterologists. Comprehensive patient clinical data, colonoscopy acquisition information, and polyp histopathological information are also included in each video. With its unprecedented size, quality, and heterogeneity, the REAL-Colon dataset is a unique resource for researchers and developers aiming to advance AI research in colonoscopy. Its openness and transparency facilitate rigorous and reproducible research, fostering the development and benchmarking of more accurate and reliable colonoscopy-related algorithms and models.

PubMed Disclaimer

Conflict of interest statement

C.B., P.S., and A.C. are affiliated with Cosmo Intelligent Medical Devices, the developer of the GI Genius medical device. C.H. is consultant for Medtronic and Fujifilm.

Figures

Fig. 1
Fig. 1
Flowchart outlining the two-phase selection process for creating the REAL-Colon dataset from 368 video recordings across four distrinct cohorts. Phase 1 applies a penalty scoring system based on video and histological criteria, leading to Phase 2, where the 15 videos per cohort are manually selected, after ranking, to ensure diversity and representation while maintaining the cohort average lesion count.
Fig. 2
Fig. 2
Clinical Data Distribution. This figure presents histograms depicting the distribution of sex, age, polyp count per procedure, BBPS scores, endoscope brand, and procedure duration within the REAL-Colon dataset.
Fig. 3
Fig. 3
Polyp Characteristics Distribution. The histograms in this figure highlight the distribution of the anatomical location, size (in millimeters), and histology of the polyps included in the REAL-Colon dataset.
Fig. 4
Fig. 4
Left, a histogram displaying the number of boxes per frame. On the right, the distribution of the number of bounding boxes associated to each polyp.
Fig. 5
Fig. 5
Left: Histogram displaying the number of tracklets per polyp, using a 1-second threshold to identify separate tracklets. The x-axis represents the number of tracklets associated with each polyp, while the y-axis shows the count of polyps with that number of tracklets. Right: Plot illustrating the decrease in the number of tracklets as a function of the disappearance threshold. Here, the x-axis signifies the disappearance threshold in seconds, which determines when a new tracklet is created once a polyp disappears for longer than the threshold duration. The y-axis reports the resulting number of tracklets.
Fig. 6
Fig. 6
Boxplots contrasting actual polyp sizes with bounding box dimensions (left) and heatmaps depicting bounding box placements (right) during the early phase of appearance (≤1 s) and afterwards (>1 s). In the early frames, polyps are captured within small bounding boxes scattered across the colon. As time progresses, the endoscopist centralizes the polyps in the frame, leading to larger and more variable in dimensions bounding boxes.
Fig. 7
Fig. 7
Sample images from the testing dataset, with results from the best performing model. White boxes are the ground truth annotations, blue ellipses are the model predictions. In the first row, examples of false negative polyps are shown: (a) a small and distant polyp, (b) a polyp partially covered by water/bubbles, (c) a polyp framed in blue light, (d) a large polyp near the image boundary and overexposed. In the second row, examples of false positive detections are shown: (e) the model activates on a artifact due to stain and motion blur, (f) the model activates on a solid residue, (g) the model activates on an area of the colonic mucosa that is not well inflated, (h) the model activates on a dark and distant area of the colonic mucosa whose shape is similar to a polyp.

References

    1. Sung H, et al. Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021;71:209–249. doi: 10.3322/caac.21660. - DOI - PubMed
    1. Morgan E, et al. Global burden of colorectal cancer in 2020 and 2040: incidence and mortality estimates from globocan. Gut. 2023;72:338–344. doi: 10.1136/gutjnl-2022-327736. - DOI - PubMed
    1. Bretthauer M, et al. Effect of colonoscopy screening on risks of colorectal cancer and related death. N. Engl. J. Med. 2022;387:1547–1556. doi: 10.1056/NEJMoa2208375. - DOI - PubMed
    1. Zorzi M. Adenoma detection rate and colorectal cancer risk in fecal immunochemical test screening programs: An observational cohort study. Ann. Intern. Med. 2023;176:303–310. doi: 10.7326/M22-1008. - DOI - PubMed
    1. Dekker E, Rex DK. Advances in crc prevention: Screening and surveillance. Gastroenterology. 2018;154:1970–1984. doi: 10.1053/j.gastro.2018.01.069. - DOI - PubMed