Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 May;41(7):1859-1874.
doi: 10.1002/hbm.24917. Epub 2020 Jan 10.

Tractostorm: The what, why, and how of tractography dissection reproducibility

Affiliations
Review

Tractostorm: The what, why, and how of tractography dissection reproducibility

Francois Rheault et al. Hum Brain Mapp. 2020 May.

Abstract

Investigative studies of white matter (WM) brain structures using diffusion MRI (dMRI) tractography frequently require manual WM bundle segmentation, often called "virtual dissection." Human errors and personal decisions make these manual segmentations hard to reproduce, which have not yet been quantified by the dMRI community. It is our opinion that if the field of dMRI tractography wants to be taken seriously as a widespread clinical tool, it is imperative to harmonize WM bundle segmentations and develop protocols aimed to be used in clinical settings. The EADC-ADNI Harmonized Hippocampal Protocol achieved such standardization through a series of steps that must be reproduced for every WM bundle. This article is an observation of the problematic. A specific bundle segmentation protocol was used in order to provide a real-life example, but the contribution of this article is to discuss the need for reproducibility and standardized protocol, as for any measurement tool. This study required the participation of 11 experts and 13 nonexperts in neuroanatomy and "virtual dissection" across various laboratories and hospitals. Intra-rater agreement (Dice score) was approximately 0.77, while inter-rater was approximately 0.65. The protocol provided to participants was not necessarily optimal, but its design mimics, in essence, what will be required in future protocols. Reporting tractometry results such as average fractional anisotropy, volume or streamline count of a particular bundle without a sufficient reproducibility score could make the analysis and interpretations more difficult. Coordinated efforts by the diffusion MRI tractography community are needed to quantify and account for reproducibility of WM bundle extraction protocols in this era of open and collaborative science.

Keywords: bundle segmentation; diffusion MRI; inter-rater; intra-rater; reproducibility; tractography; white matter.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Illustration of the dissection plan of the PyT using the MI‐Brain software (Rheault, Houde, Goyette, Morency, & Descoteaux, 2016). Three axial inclusion ROIs (pink, green, yellow), one sagittal exclusion ROIs (orange), two coronal exclusion ROIs (light yellow), and a cerebellum exclusion ROIs (red, optional). The whole brain tractogram was segmented to obtain the left PyT. PyT, pyramidal tract; ROIs, regions of interest
Figure 2
Figure 2
Representation of the Dice Coefficient (overlap) for both the streamline and the voxel representation. For the purpose of a didactic illustration, four streamlines are showed in a 2×5 “voxel grid,” the red and blue streamlines are identical. Each streamline is converted to a binary mask (point‐based for simplicity) shown in a compact representation. Voxels with points from three different streamlines will results in voxels with three different colors, this can be seen as a spatial smoothing. The matrices on the right show values for all pairs (symmetrical). The green and yellow streamline are not identical, which results in a streamline‐wise Dice coefficient of zero. However, in the voxel representation they have three voxels in common and the result is 2×35+3=0.75
Figure 3
Figure 3
Representation of the study design showing N participants, each received five HCP datasets (listed and color coded) which were replicated three times (original, flipped, translated). All participants had to perform the same dissection tasks, on the same anonymized datasets. Intra‐rater, inter‐rater, and gold standard reproducibility were computed using the deanonymized datasets. More details are available in the Supporting Information
Figure 4
Figure 4
Comparison of bundles and the impacts of spurious streamlines on the reproducibility measurements. Each block shows streamlines on the left and the voxel representation on the right (isosurface). Block 2a and 3a shows the core (green/orange) and spurious (red/pink) portion of the bundle. Block 2b and 3b only shows the core portion of the bundle. Table showing the reproducibility “score” between bundles, VOX marks voxel‐wise measures, and STR marks streamlinewise measures
Figure 5
Figure 5
Example of average segmentation, or gold standard, generation obtained from seven different segmentations, first row shows the streamline representation and the second row shows the voxel represented as a smooth isosurface. From left to right, multiple voting ratios were used 17,37,57,77, each time reducing the number of streamlines and voxels consider part of the average segmentation. A minimal vote set at one out of seven (left) is equivalent to a union of all segmentations while a vote set at seven out of seven (right) is equivalent to an intersection between all segmentations
Figure 6
Figure 6
Measurements (Q 2; IQR) related to individual files for both groups. The Average FA distribution for experts (0.49; 0.01) and nonexperts (0.47;0.03) is not statistically different from each other. Similarly, the average length of experts (140.33 mm; 7.81 mm) and nonexperts (138.70 mm; 11.29 mm) cannot be distinguished. Streamlines count of experts (2,893; 3564*) has a significant difference of distribution from nonexperts (9,383; 12,368*). The same can be same from the volume distribution (34.00 cm3; 16.43 cm3*) for experts and (48.74 cm3; 24.57 cm3*) for nonexperts. The lower and higher fences for nonexperts are much wider, indicating more variation in results
Figure 7
Figure 7
Measurements (Q 2; IQR) related to pairwise comparison measures for intra‐rater segmentations. The correlation of density maps showed no statistically significant difference between the experts (0.90; 0.17) and the nonexperts (0.90; 0.17) groups. Distributions showed statistically significant difference for both Dice score. The Dice score of streamlines shows a easily observable difference between experts (0.10; 0.39*) and nonexperts (0.37; 0.46*). The difference between distribution Dice score of voxels is less noticeable at (0.75; 0.15*) for experts and (0.79; 0.14*) for nonexperts. The trend for the intra‐rater reproducibility is that rater fails to select the same streamlines, but the ones that are selected still cover approximately the same volume. IQR: interquartile range
Figure 8
Figure 8
Measurements (Q 2; IQR) (Q 2; IQR) related to pairwise comparison measures for inter‐rater segmentations. The correlation of density maps showed no statistically significant difference between the experts (0.82; 0.23*) and the nonexperts (0.77; 0.29*) groups. Similarly to the intra‐rater segmentation, distributions showed statistically significant difference for both Dice score. The Dice score of streamlines shows a easily observable difference between experts (0.11; 0.14*) and nonexperts (0.18; 0.32*). While the distribution Dice score of voxels for experts (0.63; 0.20*) and nonexperts (0.67; 0.18*) is more similar. Raters have difficulty to select the same streamlines, but overall capture similar volume. IQR: interquartile range
Figure 9
Figure 9
Measurements (Q 2; IQR) related to pairwise comparison measures against the gold standard. The correlation of density map reaching (0.95; 0.04*) for experts and (0.88;1 5*) is statistically different between both groups. However, the Dice score of streamlines are not statistically different at (0.39; 0.18) and (0.34; 0.34), respectively. The Dice score of voxel is relatively high at (0.82; 0.05*) for experts and (0.76; 0.13*) for nonexperts. Despite variations between rater, overall the participants remain around the same average segmentation and obtain more agreement with the gold standard than with each other. IQR: interquartile range
Figure 10
Figure 10
Measurements (Q 2; IQR) related to binary classification measures against the gold standard. The Kappa score is only significantly different for voxel (0.84; 0.06 and 0.80; 0.13) and not for streamlines (0.60; 0.16* and 0.65; 0.41*). There is a high degree of variability for precision and sensitivity of streamlines (0.81; 0.19* and 0.50; 0.24* for experts) and (0.59; 0.37* and 0.82; 0.44* for nonexperts). These measures are more reliable with the voxel representation (0.92; 0.10* and 0.79; 0.17* for experts) and (0.78; 0.17* and 0.82; 0.44* for nonexperts). The streamline representation is always less reproducible than the voxel representation. The measures such as accuracy and specificity are not shown due to the fact that both reach above 0.99 and do not provide useful visual insight. IQR: interquartile range

References

    1. Apostolova, L. G. , Zarow, C. , Biado, K. , Hurtz, S. , Boccardi, M. , Somme, J. , … Watson, C. (2015). Relationship between hippocampal atrophy and neuropathology markers: A 7t mri validation study of the eadc‐adni harmonized hippocampal segmentation protocol. Alzheimer's & Dementia, 11, 139–150. - PMC - PubMed
    1. Avants, B. B. , Epstein, C. L. , Grossman, M. , & Gee, J. C. (2008). Symmetric diffeomorphic image registration with cross‐correlation: Evaluating automated labeling of elderly and neurodegenerative brain. Medical Image Analysis, 12, 26–41. - PMC - PubMed
    1. Bayrak, R. G. , Schilling, K. G. , Greer, J. M. , Hansen, C. B. , Greer, C. M. , Blaber, J. A. , … Landman, B. (2019). Tractem: Fast protocols for whole brain deterministic tractography‐based white matter atlas. bioRxiv, 651935.
    1. Behrens, T. E. , Berg, H. J. , Jbabdi, S. , Rushworth, M. F. , & Woolrich, M. W. (2007). Probabilistic diffusion tractography with multiple fibre orientations: What can we gain? NeuroImage, 34, 144–155. - PMC - PubMed
    1. Behrens, T. E. , Johansen‐Berg, H. , Woolrich, M. , Smith, S. , Wheeler‐Kingshott, C. , Boulby, P. , et al. (2003). Non‐invasive mapping of connections between human thalamus and cortex using diffusion imaging. Nature Neuroscience, 6, 750–757. - PubMed

Publication types