Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 10;10(5):ENEURO.0447-22.2023.
doi: 10.1523/ENEURO.0447-22.2023. Print 2023 May.

Effect of Reverberation on Neural Responses to Natural Speech in Rabbit Auditory Midbrain: No Evidence for a Neural Dereverberation Mechanism

Affiliations

Effect of Reverberation on Neural Responses to Natural Speech in Rabbit Auditory Midbrain: No Evidence for a Neural Dereverberation Mechanism

Oded Barzelay et al. eNeuro. .

Abstract

Reverberation is ubiquitous in everyday acoustic environments. It degrades both binaural cues and the envelope modulations of sounds and thus can impair speech perception. Still, both humans and animals can accurately perceive reverberant stimuli in most everyday settings. Previous neurophysiological and perceptual studies have suggested the existence of neural mechanisms that partially compensate for the effects of reverberation. However, these studies were limited by their use of either highly simplified stimuli or rudimentary reverberation simulations. To further characterize how reverberant stimuli are processed by the auditory system, we recorded single-unit (SU) and multiunit (MU) activity from the inferior colliculus (IC) of unanesthetized rabbits in response to natural speech utterances presented with no reverberation ("dry") and in various degrees of simulated reverberation (direct-to-reverberant energy ratios (DRRs) ranging from 9.4 to -8.2 dB). Linear stimulus reconstruction techniques (Mesgarani et al., 2009) were used to quantify the amount of speech information available in the responses of neural ensembles. We found that high-quality spectrogram reconstructions could be obtained for dry speech and in moderate reverberation from ensembles of 25 units. However, spectrogram reconstruction quality deteriorated in severe reverberation for both MUs and SUs such that the neural degradation paralleled the degradation in the stimulus spectrogram. Furthermore, spectrograms reconstructed from responses to reverberant stimuli resembled spectrograms of reverberant speech better than spectrograms of dry speech. Overall, the results provide no evidence for a dereverberation mechanism in neural responses from the rabbit IC when studied with linear reconstruction techniques.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1.
Figure 1.
Effect of reverberation on a speech stimulus. A, Dimensions of the virtual room (13 × 11 × 3 m) used to simulate binaural room impulse responses (BRIRs) by the room-image method. The source speaker was positioned at either 1.5 m (blue x) or 3.0 m (red x) in front of the receivers (0° azimuth). For each of these two source-to-listener distances, we also varied the wall absorption coefficients: 20% for a highly reverberant room, 80% for a mildly reverberant room, and 100% for no reverberation (also known as dry condition). Overall, we simulated five BRIRs, one dry condition, and four reverberant conditions with direct-to-reverberant energy ratios (DRRs) ranging from +9.4 to −8.2 dB. B, The first 200 ms of an example BRIR for the most reverberant case (DRR = −8.2 dB, left ear). The BRIR is composed of the direct sound (yellow), individual early reflections (red), and overlapping late reflections (blue). C, Waveform (blue) and broadband envelope (red) of the utterance “Laugh, dance, and sing if fortune smiles on you” pronounced by a female speaker for dry and highly reverberant conditions. D, Spectrograms of the dry and reverberant utterances in C. Each row in the spectrogram represents the bandpass Hilbert envelope of the stimulus with a center frequency on a log scale given on the y-axis. Adding reverberation smears the stimulus envelope, prolongs onsets, and offsets, and fills the silent intervals between sound segments. Speech was filtered through a logarithmically spaced gammatone filterbank that simulates the response of the auditory nerve (Patterson–Holdsworth ERB Filter Bank). The spectrograms contain 30 frequency channels with center frequencies ranging from 250 Hz to 8 kHz and a temporal sampling interval of 5 ms.
Figure 2.
Figure 2.
Linear spectrogram reconstruction. To quantify the amount of stimulus information available in neural responses, we used the optimal linear reconstruction technique applied to spectrograms. In Step 1, stimulus spectrograms in the dry condition and the corresponding measured responses of an ensemble of units (“dry responses”) are used to derive the optimal reconstruction filter (“dry filter”). The dry filter is optimal in that it minimizes the mean-square error between the stimulus spectrogram and a reconstructed spectrogram for the training data. In Step 2, a different set of neural responses are used with the dry filters to reconstruct the spectrogram for each reverberant condition. We use cross-validation between the two steps, such that the dry filter is derived from a subset of the data while the reconstruction accuracy is determined for the remaining subset not used for training. For each unit, the dry reconstruction filter is a two-dimensional matrix consisting of weights along frequency (y-axis) and lag (x-axis). We used noncausal reconstruction filters that can have nonzero weights for both positive and negative lags. The 30 frequency weights range from 250 Hz to 8 kHz on a log frequency scale, as for the stimulus spectrograms. Temporal weights range from −30 to +30 ms in 5-ms steps.
Figure 3.
Figure 3.
Comparing pure-tone characteristic frequency (CF) with best frequency of correlation coefficient (BFCC) for speech stimuli. A, Frequency response area (FRA) of four neurons with CFs of 3805, 2690, 1345, and 9050 Hz. B, Scatter plot of CF against BFcc, for the 65 units (SU or MU) in which an FRA was measured. Each dot corresponds to one measurement (SU or MU), and the symbol size is proportional to the CC between the best envelope and the measured response to speech. Across the 56 recording sites with CF < 8 kHz, there is a weak correlation between CF and BFCC [SU: R2 = 0.65, p < 10−4 root-mean square error (RMSE) = 1.41; MU, R2 = 0.63, p < 10−4, RMSE = 1.27].
Figure 4.
Figure 4.
Response of a single unit (SU) and a multiunit (MU) from the same IC recording site to a speech utterance presented in various reverberant conditions. The utterance “Growing well-kept gardens is very time consuming” was pronounced by a female speaker. A, Poststimulus time histogram (PSTH; 5-ms bin width, blue bars) of the SU and median MU response (continuous blue line) to the dry speech. The black line shows the best fitting stimulus envelope (the output of the gammatone filter centered at BFCC) for the dry condition. B, Same as in A for each reverberant condition identified by the DRR on the left. All panels share the same time scale, but amplitudes were scaled to facilitate comparison. C, Pearson correlation coefficient (CC) between the neural response and the envelope of the dry speech at the output of the gammatone filter centered at BFCC as a function of DRR for both the SU and the MU. D, Response modulation depth of the SU (RMDSU), the MU (RMDMU), and stimulus modulation depth (SMD) as a function of DRR.
Figure 5.
Figure 5.
Effect of reverberation on responses of SUs (A) and MUs (B) to speech across the neural population. For each SU or MU (colored circle), we calculated the Pearson correlation coefficient (CC) between the neural response and the envelope of the dry speech at the output of the gammatone filter centered at BFCC. The white circles show the median CC across the population for each DRR. For both SUs and MUs, the median CCs decrease monotonically with increasing amount of reverberation (decreasing DRR), although the CCs for SUs are much lower and show greater variability than CCs for MUs. The black lines and squares show the stimulus CCs computed between the dry and reverberant stimulus spectrograms (i.e., not including neural processing).
Figure 6.
Figure 6.
Reverberation affects the temporal coding of amplitude modulation in IC single units. A, Response modulation depth (RMD; colored circles) and stimulus modulation depth (SMD; black rectangles) for the sample of 103 single-unit responses. For the SMD, the black rectangles show the 25th and 75th percentiles, and the red horizontal bars inside each of the black rectangles are the median SMDs across frequency channels for each DRR. Despite a slight trend for the median RMD to decrease with increasing reverberation, the effect was not statistically significant (Kruskal–Wallis test: p = 0.398, χ2 = 4.05, df = 4) because of the large variability in the data. However, the median SMDs clearly decreased with increasing amount of reverberation and approached the median RMDs for negative DRRs. B, The neural modulation gain (MG), in dB, is the ratio of the RMD to the SMD for each unit. The median MG tended to increase with increasing reverberation (Kruskal–Wallis test: p < 10−4, χ2 = 80.1, df = 4); this observation is consistent with earlier findings using sinusoidally amplitude modulated (SAM) noise stimuli (Kuwada et al., 2014; Slama and Delgutte, 2015).
Figure 7.
Figure 7.
Linear spectrogram reconstructions for dry and reverberant speech. A, Stimulus spectrograms of an utterance in dry and four reverberant conditions. B, Corresponding linear spectrogram reconstructions based on the responses of 241 multiunits. Increasing reverberation degrades the reconstruction quality, as measured by the Pearson cross-correlation (CC) between the reconstruction ( S^DRR) and the dry stimulus spectrogram ( Sdry). However, reconstruction quality remains high (CC > 0.89) so long as the DRR is > 0 dB. Severe degradation only occurs for negative DRRs.
Figure 8.
Figure 8.
Spectrogram reconstruction quality from ensemble neural responses degrades with increasing amount of reverberation and shows no evidence for a dereverberation process for both SUs (A) and MUs (B). For each DRR, boxplots show the distributions of CC scores for reconstruction quality across the 12 TIMIT utterances used as stimuli. Two CC scores are shown for each DRR. The Sdry-vs- S^DRR score (blue bars) is the CC between the dry spectrogram Sdry and the reconstructed spectrogram S^DRR. The SDRR-vs- S^DRR score (red bars) is the CC between a reverberant stimulus spectrogram (SDRR) and the reconstructed spectrogram S^DRR for the same DRR. Both scores were computed with reconstruction models that were trained with dry stimuli (dry-filter models). Black squares show the stimulus-only CCs between dry and reverberant spectrograms. Each boxplot shows the median, the interquartile range (IQR), and the nonoutlier minimum and maximum. Outliers (circles) are defined as having values >1.5 IQR above the upper quartile.
Figure 9.
Figure 9.
Three factors contribute to the observed degradation in reconstruction quality with increasing reverberation for both SUs (A) and MUs (B). Three reconstruction methods differing in the contribution of each factor are compared as a function of the amount of reverberation (DRR). The SDRR-vs- S^DRR (same-DRR filters) correlation coefficients (yellow bars) represent the degradation because of envelope tracking errors, which is the inability of the linear reconstruction model to perfectly track the stimulus envelope when the model is trained and tested with stimuli with the same degree of reverberation. The SDRR-vs- S^DRR (dry filters) scores (red bars) include the additional degradation because of model generalization failure, which refers to the model’s inability to capture new reconstructions when trained with dry stimuli and tested with reverberant stimuli. Finally, the Sdry-vs- S^DRR (dry filters) CC scores (blue bars) include the additional effects of distortion compensation failure, which is the inability to compensate for the distortion of the original speech envelope introduced by reverberation. Black squares show the stimulus-only CCs; p-values for post hoc paired comparisons between reconstruction methods based on two-way repeated measures ANOVA test are shown (see text).
Figure 10.
Figure 10.
Quality of spectrogram reconstruction from ensemble responses of both SUs (A) and MUs (B) improves with increasing ensemble size and is better for MUs than for SUs for small ensemble sizes and modest reverberation. For each amount of reverberation, reconstruction quality was quantified by the Pearson correlation coefficient (CC) between the dry stimulus spectrogram and the corresponding spectrogram reconstruction. This was done for neural ensembles of various sizes. Twenty-five MU measurements sufficed to reach an asymptote in reconstruction quality for all DRRs, that is, adding more units to the ensemble did not improve CCs substantially. With ensemble of size ≥50, reconstruction quality was high (CC > 0.8) for dry and mild reverberation conditions (DRR > 0 dB) but deteriorated markedly in severe reverberation (DRR < 0 dB). Reconstruction quality was higher when based on MUs than when based on SUs, especially for small ensemble sizes and low reverberation. Black squares show stimulus CCs between the dry and reverberant stimulus spectrograms (i.e., not including neural responses). This benchmark was reached in severe reverberation for reconstructions based on MUs.
Figure 11.
Figure 11.
Temporal variations in spectrogram reconstruction quality over the course of an utterance. A, Dry stimulus spectrogram of the utterance “Laugh, dance and sing if fortune smiles on you” pronounced by a female speaker. Selected phones are labeled below the spectrogram (not all phonemes are shown to avoid clutter). Purple horizontal lines show voiced segments identified using the probabilistic YIN (pYIN) algorithm. B, Time-dependent correlations (CCt) were calculated between pairs of spectrograms using 5-ms time steps and over the whole frequency range (30 frequency bands). When assessed against the dry speech spectrogram, the quality of reconstruction derived from responses to reverberant speech fluctuates over time (blue curve). These fluctuations in reconstruction quality closely parallel the short-term cross-correlation between the dry and reverberant speech spectrograms (black curve), suggesting they are largely stimulus-driven. Fluctuations in reconstruction quality are less pronounced when assessed against the reverberant speech spectrogram (red curve).
Figure 12.
Figure 12.
A, B, Scatter plots of short-term cross-correlation (CCt) between pairs of spectrograms (time steps is 5 ms as in Fig. 11B). A, Reconstruction scores CCt between dry and reconstructed spectrograms (y-axis) are comparable (Pearson correlation test, rt=0.836, p<104, orthogonal regression slope: 1.08) to CCt between the dry and reverberant speech spectrograms (x-axis). B, Reconstruction scores of CCt between highly reverberant spectrograms (DRR = −8.2 dB) and the corresponding reconstructed spectrograms show less resemblance (Pearson correlation test, rt=0.501, p<104, orthogonal regression slope: 2.19). C, Reconstruction quality was estimated separately for voiced and unvoiced segments. Histograms show the distribution of Fisher-transformed CCt between the dry stimulus spectrograms and the spectrogram reconstructed in strong reverberation (DRR = −8.2 dB) for voiced (purple bars) and unvoiced (green bars) segments. For comparison, the distribution of the CCt between the dry stimulus spectrogram and the reverberant stimulus spectrogram (DRR = −8.2 dB) are shown for voiced (purple line) and unvoiced (green line) segments. Reconstruction quality was consistently high for voiced speech but varied widely for unvoiced speech. In addition, the reconstruction quality (colored bars) parallels the distributions of correlations between dry and reverberant speech spectrograms (colored lines). D, Histogram of the distributions of stimulus energy for voiced and unvoiced segments computed from the dry spectrogram (silent segments were omitted). The energy distributions clearly overlap between voiced and unvoiced segments, in contrast to the more separated distribution of CCt in C. This suggests that differences in stimulus energy cannot entirely explain the greater reconstruction quality observed for voiced segments compared with unvoiced segments.

References

    1. Allen JB, Berkley DA (1979) Image method for efficiently simulating small-room acoustics. J Acoust Soc Am 65:943–950. 10.1121/1.382599 - DOI
    1. Arweiler I, Buchholz JM (2011) The influence of spectral characteristics of early reflections on speech intelligibility. J Acoust Soc Am 130:996–1005. 10.1121/1.3609258 - DOI - PubMed
    1. Atiani S, Elhilali M, David SV, Fritz JB, Shamma SA (2009) Task difficulty and performance induce diverse adaptive patterns in gain and shape of primary auditory cortical receptive fields. Neuron 61:467–480. 10.1016/j.neuron.2008.12.027 - DOI - PMC - PubMed
    1. Barlow H (2001) Redundancy reduction revisited. Netw Comput Neural Syst 12:241–253. 10.1080/net.12.3.241.253 - DOI - PubMed
    1. Bialek W, Rieke F, De Ruyter Van Steveninck RR, Warland D (1991) Reading a neural code. Science 252:1854–1857. 10.1126/science.2063199 - DOI - PubMed

Publication types