addition to variation in stimulus talkers and phonetic contexts, the nature of corrective feedback (CF) has also been shown to impact perceptual learning. While corrective feedback of some sort is a feature of all HVPT training, Lee and Lyster (2017) explored what role specific types of correct CF play in the transfer of perceptual training to production. Their HVPT experiment tested CF in four conditions. One group received feedback only that their response was incorrect, but was not told what the correct response should have been. Three other groups received feedback when they were wrong, but also received additional input, either by hearing an example of the target item again, hearing an example of the non-target item they had selected, or hearing both the target and non-target sounds. Only the groups that heard either the target form or the non-target form as part of CF displayed transfer to production. Those that did not hear either target or non-target items, or those that heard both, did not improve in production. In sum, Lee and Lyster’s study confirms that drawing learners’ attention to errors, either through negative or positive evidence, contributes to learning, just as it does in L1 pronunciation development.
Two other approaches to perceptual training are also worth noting, although neither has a sufficient evidence base to support its widespread adoption. Underlying both seems to be an assumption that accurate perception of one’s own productions (i.e., self-perception) has a facilitative effect in ultimately matching what one perceives with what one produces (Baker & Trofimovich, 2006; Borden et al., 1983). While neither is a purely perception-oriented technique, both include a perceptual component through the use of imitation. Rojczyk (2015) instructed Polish learners of English to imitate English-accented Polish sentences, targeting particular English consonants. This practice may have had the effect of orienting learners’ attention to phonetic information produced with their own voices in their L1. The researcher found that imitation of English-accented Polish led to positive transfer in their L2 English pronunciation. Other researchers are testing what they call a “Golden Speaker” approach to making perceptual learning easier (Ding et al., 2019). This is based on a belief that there are ideal voices from which particular learners can best develop L2 speech perception. The Golden Speaker web-based application maps a learner’s own voice quality onto the correct pronunciation of target sounds produced by a native speaker. The system then generates training stimuli that sound like the learner’s voice, but without segmental errors. It is assumed that this will make it easier for learners to attend to those parts of the acoustic signal that are distinct from their own voices, because it simulates a vocal tract size and shape that is exactly the same as the learners’ own, but with an articulation model that is native speaker-like. To the extent to which these alternative approaches to L2 perception training work, they may be preferable to HVPT, which is more labor-intensive since it requires the accumulation of training stimuli from multiple talkers.
Pedagogical Implications
While HVPT is effective in re-educating selective perception across the lifespan, and encourages progress beyond the WMO (Thomson, 2012a), it is not particularly easy to implement. Until user-friendly HVPT training platforms become more widely accessible, classroom pronunciation instruction will remain the most common mode of delivery. Even when HVPT or other digital applications are used for perception training, they should be seen as a complement to classroom-based instruction, not as a replacement. Ultimately, learners still need opportunities for live interaction to move beyond highly controlled L2 pronunciation skills and towards automatic L2 speech production. In designing classroom instructional tasks, teachers should take note of key findings in the L2 speech perception literature described above, first, by allowing more instructional time for perception-oriented training. Short batteries of perceptual instruction trigger improvement in both perception and production (Thomson, 2018a), usually without any need for explicit production practice. Second, instructors can ensure that multiple talkers are used to provide perceptual input. This could include the teacher in addition to recorded materials. Note that this is different from the common practice of providing multiple English accents and dialects in listening materials (e.g., British, American, Indian). In perception-focused pronunciation instruction, including talkers from different target-language varieties is disadvantageous, since the goal is to learn a particular model as opposed to learning to comprehend other-accented speech; however, the practice of including multiple accented voices is most useful for the teaching of listening. Third, appropriate types of corrective feedback should be employed (Lee & Lyster, 2016, 2017). Learners should not only be told that they made a mistake in perception but should also be given an opportunity to hear the item again, or hear an example of the non-target item they incorrectly selected. They should not receive feedback that includes replays of both target and non-target items, as this appears to lead to confusion. Fourth, training should incorporate both nonsense syllables/words and real words. The use of nonsense words seems to make phonetic information more salient, while teaching the perception of the same sounds in real words will promote transfer to the real world (Thomson & Derwing, 2016).
Since L2 pronunciation develops slowly and requires substantial amounts of input, it is important to focus perceptual training on those sounds that have the greatest Functional Load (FL). FL refers to the relative contribution of particular sound categories to the communication of meaning in a given language (Derwing & Munro, 2015; Munro & Derwing, 2006). This can be calculated on the basis of the frequency with which sounds contrast with other sounds, and may also incorporate information on grammatical categories within which minimal pairs are found, since within-category confusion is more problematic to communicating meaning than between-category confusions. For example, an error between “lock” and “rock” would be more confusing to a listener, since both are nouns, than an error between “laughed” and “raft,” where a mispronunciation is more likely to be successfully decoded by listeners, since one is a verb and the other a noun. Some particularly salient sound contrasts, such as those involving the English “th” sounds – /ð/ and /θ/ – have a very low functional load because despite being high frequency, they mainly occur in function words. Brown (1991) and Catford (1987) provide detailed FL information for English. In addition to overall FL, the location of particular errors within words has an impact on whether mispronunciations will lead to a breakdown in communication. In general, vowels contribute more to intelligibility than do consonants (Bent et al., 2007). Consonant errors at the beginning of syllables affect communication more than consonant errors at the ends of syllables. Taking these facts into account, pronunciation teachers can prioritize what sounds to teach and in what context. This is important given that learning in one phonetic context or word does not easily transfer to new contexts (Levis, 2018).
While most of the research on perceptual training has focused on segmentals, the same principles can be applied to determine which suprasegmental features warrant instruction. Derwing et al. (2012) demonstrated that some L2 English prosodic patterns can develop without the need for explicit instruction (e.g., word stress), while other patterns may benefit from instruction (e.g., sentence stress). There are also individual differences related to the L1 background of the listeners.
Lee and Lyster’s (2016) study is a great starting point for envisioning how perception-oriented training can be incorporated into the classroom in an evidence-based manner. They used several activities in their training sessions. One was a Pick-a-Card game, where each half of a minimal pair was written on either side of the same card (e.g., “lip” versus “rip”). Instructors produced target words and individual learners had to show the side of the card that matched the word they perceived. The participants then received immediate feedback on the accuracy of their response. This is different from a traditional minimal-pair activity in two ways. First, learners heard more than one talker produce the same sounds, since three teachers were involved in the training. Second, they received immediate corrective feedback, which is often not the case in a paper-and-pencil minimal pair task. Lee and Lyster also used Word Bingo and Fill-in-the-Blank activities to teach speech perception by incorporating minimal pairs that were known to cause perceptual confusion to this group of learners. As with the Pick-a-Card game, input was provided by more than one teacher and corrective feedback was given immediately after every response. These same types of activities could easily incorporate nonsense syllables/words in place of real words. This would require learners to know a phonetic alphabet, or to use key words containing the target sounds as labels for them to indicate