Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2016 Jan;43(1):28-44.
doi: 10.1002/jmri.24997. Epub 2015 Jul 14.

Recommendations for real-time speech MRI

Affiliations
Review

Recommendations for real-time speech MRI

Sajan Goud Lingala et al. J Magn Reson Imaging. 2016 Jan.

Abstract

Real-time magnetic resonance imaging (RT-MRI) is being increasingly used for speech and vocal production research studies. Several imaging protocols have emerged based on advances in RT-MRI acquisition, reconstruction, and audio-processing methods. This review summarizes the state-of-the-art, discusses technical considerations, and provides specific guidance for new groups entering this field. We provide recommendations for performing RT-MRI of the upper airway. This is a consensus statement stemming from the ISMRM-endorsed Speech MRI summit held in Los Angeles, February 2014. A major unmet need identified at the summit was the need for consensus on protocols that can be easily adapted by researchers equipped with conventional MRI systems. To this end, we provide a discussion of tradeoffs in RT-MRI in terms of acquisition requirements, a priori assumptions, artifacts, computational load, and performance for different speech tasks. We provide four recommended protocols and identify appropriate acquisition and reconstruction tools. We list pointers to open-source software that facilitate implementation. We conclude by discussing current open challenges in the methodological aspects of RT-MRI of speech.

Keywords: rapid imaging; real time MRI; recommendations; speech imaging.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
Spatiotemporal resolution requirements for various speech tasks. The placement of these “zones” reflects the current consensus opinion among speech imaging researchers in attendance at the 2014 Speech MRI Summit. Boundaries are approximate due to the lack of gold-standard imaging techniques, and are being refined through the more widespread adoption of noninvasive techniques such as RT-MRI. The plots depict the spatiotemporal resolutions that could be realized by the four recommended protocols in Table 1. Note that each of the protocols have tradeoffs beyond spatial and temporal resolutions, which require careful consideration and are discussed in detail in the text. For instance, SNR losses at high spatial resolutions (all protocols), off-resonance artifacts (Protocol 2), long acquisition times (Protocol 4).
FIGURE 2
FIGURE 2
Comparison of image quality from different receiver coils in one adult subject. Images were acquired using Protocol 2, spiral gradient echo imaging; 2.4 × 2.4 mm2 spatial resolution. Compared to (left) a standard head coil (middle, right) custom airway coils provide improved SNR in all vocal tract articulators including the lips, velum, epiglottis, and tongue. The SNR gain ranges from 50% to 1600%, and is greatest for articulators closest to the coil elements (the lips in this case); also see. Custom array coils can also provide improved parallel imaging performance (not shown).
FIGURE 3
FIGURE 3
Comparison of GRE and SSFP image quality at 3T. Note that the SSFP images have higher SNR but are sensitive to off-resonance and its associated banding artifacts. In this example, the banding artifacts manifested as signal voids on the velum and portions of tongue (see arrows).
FIGURE 4
FIGURE 4
Example images of velopharyngeal closure obtained at 1.5T using protocol 1. (a,b) The placement of the shim volume (green box) centered around the velum (yellow arrow) for a 39-year-old male volunteer. Despite careful shimming images can degrade substantially in some subjects during the speech sample acquisition as demonstrated by images (c,d) in a 28-year-old patient with repaired cleft lip and palate. The velum (yellow arrow) almost completely disappears in the elevated position (d). Although h-EPI sequences have a lower SNR (e,f), it is possible to track the velum (yellow arrow) throughout the speech sample.
FIGURE 5
FIGURE 5
Simulated aliasing artifacts for spiral, radial, and Cartesian acquisitions. Simulations are based on 2.4 × 2.4 mm2 resolution with 20 × 20 cm2 FOV of a mid-sagittal slice, and are realized by retrospective subsampling of a reconstructed image formed from a 34 interleave spiral dataset, using different trajectories. All the trajectories were simulated with realistic TRs that were obtained from a 1.5T GE scanner with modern gradient specifications (40 mT/m amplitude and 150 mT/m/ms slew rate). The left, middle, and right columns respectively correspond to Protocols 2, 3, 1; all based on gradient echo imaging. In comparison to Cartesian aliases, spiral, and radial aliasing artifacts are less coherent and less visually detrimental to visualization of the vocal tract air space and articulators.
FIGURE 6
FIGURE 6
Effect of spiral readout duration on image quality. Each image represents a single frame captured when the subject sustained the nasal sound /n/. The spiral readout durations were (a) 2.520 msec, (b) 3.584 msec, (c) 4.576 msec, (d) 6.368 msec, and (e) 10.560 msec. Identical shim values were used for (a–e). Blurring artifacts are most prominent near air-tissue interfaces of the tongue, lips, and velum. (Figure courtesy: Y-C. Kim, Samsung Research, Seoul, South Korea.)
FIGURE 7
FIGURE 7
Off-resonance blurring in spiral MRI can be mitigated by real-time adjustment of the center frequency. Note that air-tissue interfaces (yellow arrows) are the locations of largest variation in resonant frequency.
FIGURE 8
FIGURE 8
Redundancy in dynamic speech data: A reconstructed time series with spatial resolution of 2.4 mm2, and temporal resolution of 12 msec/frame, obtained from an ~6.5-fold accelerated reconstruction is considered for demonstration. (a) A representative spatial frame of the time series, and (b) the time profile along the arrow specified in (a). (c) The temporal finite difference of the time series, which is obtained by taking the pixel-wise finite difference of the dynamic data along time. The sparse representation of the temporal finite difference transform was exploited to realize the accelerated reconstruction in (a,b). The dynamic data also have sparse representations in the spatial-spectral domain (d); this representation is obtained by evaluating the Fourier transform of the data along the temporal dimension. The dynamic data can be rearranged as a Casorati matrix (vectorized and represented column-wise). A singular value decomposition of this matrix reveals the redundancies amongst the pixel time profiles, ie, fast decay of the singular values as depicted in the (e), which is exploited in the low rank and partially separable models.
FIGURE 9
FIGURE 9
Subsampling in combination with efficient constrained reconstruction allows for improved time resolution in comparison to Nyquist sampling; a spiral-based acquisition as depicted in Protocol 2 is used. Images reconstructed using (a) Nyquist sampling and online gridding reconstruction result in time resolution of 78 msec/frame, while (b) sub-Nyquist sampling and online gridding reconstruction allows for significantly improved time resolution, at the expense of aliasing artifacts. Offline constrained reconstruction addresses this tradeoff by resolving the aliasing at the native time resolution of 12 msec/frame. Note the apparent advantage of the increase in time resolution of (c) vs. (a) in terms of crispness along the time axis. The task was to repeatedly count numbers “one-two-three-four-five” at a rapid pace (the subject spoke ~4 times faster than his normal speech pace).
FIGURE 10
FIGURE 10
Example of acquisition with Protocol 3: The raw data were acquired with a radial gradient echo sequence with golden angle rotation. Images were reconstructed using a conjugate gradient-SENSE method that applies regularization using a spatial total variation (TV) operator. Twenty-five subsequent echoes were used to calculate low resolution sensitivity maps for each coil and a complete image with a spatial resolution of 1.8 × 1.8 × 10 mm3, leading to a native temporal resolution of 55 msec that was further accelerated to 40 msec by applying a sliding window. (Figure courtesy: M. Burdumy, University Medical Center Freiburg, Germany.)
FIGURE 11
FIGURE 11
Strip plot through a single slice at the tip of the tongue (indicated by the dotted line) during an acquisition with Protocol 4, using spatial-spectral regularization and the PS model for reconstruction. The acquisition acquired eight simultaneous sagittal slices at 102 frames per second while the subject uttered a simple speech sample.

Similar articles

Cited by

References

    1. Scott AD, Wylezinska M, Birch MJ, Miquel ME. Speech MRI: morphology and function. Phys Med. 2014;30:604–618. - PubMed
    1. Bresch E, Kim Y-C, Nayak K, Byrd D, Narayanan S. Seeing speech: capturing vocal tract shaping using real-time magnetic resonance imaging. IEEE Signal Proc Mag. 2008;25:123–132.
    1. Demolin D, Hassid S, Metens T, Soquet A. Real-time MRI and articulatory coordination in speech. Comptes Rendus Biol. 2002;325:547–556. - PubMed
    1. Honda K, Takemoto H, Kitamura T, Fujita S, Takano S. Exploring human speech production mechanisms by MRI. IEICE Trans Inform Syst. 2004;87:1050–1058.
    1. NessAiver MS, Stone M, Parthasarathy V, Kahana Y, Paritsky A. Recording high quality speech during tagged cine-MRI studies using a fiber optic microphone. J Magn Reson Imaging. 2006;23:92–97. - PubMed

Publication types