during prolonged noise exposure. Patients with such selective fiber loss would still be able to hear quiet sounds quite well because their HSR fibers are intact, but they would find it very difficult to resolve sounds in high sound levels when HSR fibers are saturating and the LSR fibers that should encode spectral contrast at these high levels are missing. It has long been recognized that our ability to hear speech in noise tends to decline with age, even in those elderly who are lucky enough to retain normal auditory sensitivity (Stuart & Phillips, 1996), and it has been suggested that cumulative noise‐induced damage to LSR fibers such as that described by Kujawa and Liberman in their mouse model may pinpoint a possible culprit. Such hidden hearing loss, which is not detectable with standard audiometric hearing tests that measure sensitivity to probe tones in quiet, can be a significant problem, for example by taking all the fun out of important social occasions, such as lively parties and get‐togethers, which leads to significant social isolation. However, some recent studies have looked for, but failed to find, a clear link between greater noise exposure and poorer reception of speech in noise (Grinn et al., 2017; Grose, Buss, & Hall, 2017), which would suggest that perhaps the decline in our ability to understand speech in noise as we age may be more to do with impaired representations of speech in higher cortical centers than with impaired auditory nerve representations.
Of course, when you listen to speech, you don’t really want to have to ask yourself whether, given the current ambient sound levels, you should be listening to your HSR or your LSR auditory nerve fibers in order to get the best representation of speech formants, and one of the jobs of the auditory brainstem and midbrain circuitry is to combine information across these nerve fiber populations so that representations at midbrain and cortical stations will automatically adapt to changes both in mean sound level and in sound‐level contrast or variability, so that features like formants are efficiently encoded whatever the current acoustic environment happens to be (Dean, Harper, & McAlpine, 2005; Rabinowitz et al., 2013; Willmore et al., 2016).
As we saw earlier, the tonotopic representation of speech‐sound spectra in the auditory nerve provides much information about speech formants, but not a great deal about harmonics, which would reveal voicing or voice pitch. We probably owe much of our ability to nevertheless hear voicing and pitch easily and with high accuracy to the fact that, in addition to the small number of resolved harmonics, the auditory nerve delivers a great deal of so‐called temporal fine structure information to the brain. To appreciate what is meant by that, consider Figure 3.4, which shows the waveform (top), a spectrogram (middle) and an auditory nerve neurogram display (bottom) for a recording of the spoken word ‘head.’ The neurogram was produced by computing firing rates of a bank of LSR auditory nerve fibers in response to the sound as a function of time using the model by Zilany, Bruce, and Carney (2014). The waveform reveals the characteristic energy arc remarked upon by Greenberg (2006) for spoken syllables, with a relatively loud vowel flanked by much quieter consonants. The voicing in the vowel is manifest in the large sound‐pressure amplitude peaks, which arise from the glottal pulse train at regular intervals of approximately 7 ms, that is at a rate of approximately 140 Hz. This voice pitch is also reflected in the harmonic stack in the spectrogram, with harmonics at multiples of ~140 Hz, but this harmonic stack is not apparent in the neurogram. Instead we see that the nerve fiber responses rapidly modulate their firing rates to produce a temporal pattern of bands at time intervals which either directly reflect the 7 ms interval of the glottal pulse train (for nerve fibers with CFs below 0.2 kHz or above 1 kHz) or at intervals that are integer fractions (harmonics) of the glottal pulse interval. In this manner auditory nerve fibers convey important cues for acoustic features such as periodicity pitch by phase locking their discharges to salient features of the temporal fine structure of speech sounds with sub‐millisecond accuracy.
Figure 3.4 Waveform (top), spectrogram (middle), and simulated LSR auditory nerve‐fiber neurogram of the spoken word head [hɛːd].
As an aside, note that it is quite common for severe hearing impairment to be caused by an extensive loss of auditory hair cells in the cochlea, which can leave the auditory nerve fibers largely intact. In such patients it is now often possible to restore some hearing through cochlear implants, which use electrode arrays implanted along the tonotopic array to deliver direct electrical stimulation to the auditory nerve fibers. The electrical stimulation patterns delivered by the 20‐odd electrode contacts provided by these devices are quite crude compared to the activity patterns created when the delicate dance of the basilar membrane is captured by some 3,000 phenomenally sensitive auditory hair cells, but because coarsely resolving only a modest number of formant peaks is normally sufficient to allow speech sounds to be discriminated, the large majority of deaf cochlear implant patients do gain the ability to have pretty normal spoken conversations – as long as there is little background noise. Current cochlear implant processors are essentially incapable of delivering any of the temporal fine structure information which we have just described via the auditory nerve, and consequently cochlear implant users miss out on things like periodicity pitch cues, which may help them separate out voices in a cluttered auditory scene. A lack of temporal fine structure can also affect the perception of dialect and affect in speech, as well as melody, harmony, and timbre in music.
Subcortical pathways
As we saw in Figure 3.1, the neural activity patterns just described are passed on first to the cochlear nuclei, and from there through the superior olivary nuclei to the midbrain, thalamus, and primary auditory cortex. As mentioned, each of these stations of the lemniscal auditory pathway has a tonotopic structure, so all we learned in the previous section about tonotopic arrays of neurons representing speech formant patterns neurogram style still applies at each of these stations. But that is not to say that the neural representation of speech sounds does not undergo some transformations along these pathways. For example, the cochlear nuclei contain a variety of different neural cell types that receive different types of converging inputs from auditory nerve fibers, which may make them more or less sensitive to certain acoustic cues. So‐called octopus cells, for example, collect inputs across a number of fibers across an extent of the tonotopic array, which makes them less sharply frequency tuned but more sensitive to the temporal fine structure of sounds such glottal pulse trains (Golding & Oertel, 2012). So‐called bushy cells in the cochlear nucleus are also very keen on maintaining temporal fine structure encoded in the timing of auditory nerve fiber inputs with very high precision, and passing this information on undiminished to the superior olivary nuclei (Joris, Smith, & Yin, 1998). The nuclei of the superior olive receive converging (and, of course, tonotopically organized) inputs from both ears, which allows them to compute binaural cues to the direction that sounds may have come from (Schnupp & Carr, 2009). Thus, firing‐rate distributions between neurons in the superior olive, and in subsequent processing stations, may provide information not just about formants or voicing of a speech sound, but also about whether the speech came from the left or the right or from straight ahead. This adds further dimensions to the neural representation of speech sounds in the brainstem, but much of what we have seen still applies: formants are represented by peaks of activity across the tonotopic array, and the temporal fine structure of the sound is represented by the temporal fine structure of neural firing patterns. However, while the tonotopic representation of speech formants remains preserved throughout the subcortical pathways up to and including those in the primary auditory cortex, temporal fine structure at fast rates of up to several hundred hertz is not preserved much beyond the superior olive. Maintaining the sub‐millisecond precision of firing patterns across a chain of chemical synapses and neural cell membranes that typically have temporal jitter and time constants in the millisecond range is not easy. To be up to the job, neurons in the cochlear nucleus and olivary nuclei have specialized synapses and ion channels that more ordinary neurons in the rest of the nervous system lack.
It