more opportunities for teachers to explicitly orient learners’ attention to mismatches and to provide them with more attention-orienting input.
Naturalistic L2 Pronunciation Development
As in L1 learning, L2 pronunciation naturalistically develops through experience with the ambient language. During the very earliest phases of adult L2 learning, aural and/or written input typically predominate as models for pronunciation. Unlike L1 learning, however, adult L2 learners are already effective communicators in one language and aspire to quickly experience similar success in their L2. This typically means that they will fail to develop accurate L2 perception before hastening to produce L2 speech. Their own productions then become the primary input to their new L2 system (Schmidt & Frota, 1986). Interestingly, younger L2 learners often go through a self-imposed silent period when first immersed in an L2 environment (Ervin-Tripp, 1974), lasting for up to six months (Winitz et al., 1995). This parallels the pattern observed in infant L1 phonological development, and at least in experimental conditions, appears to benefit learners’ pronunciation skills (Trofimovich et al., 2009). Trofimovich et al. argue that adult learners’ L2 pronunciation does not benefit from a similar silent period, but they provide no evidence to support this claim. Rather, it seems they view a silent period to be an impractical and unnecessary delay to adult learners’ ultimate social and employment goals.
Despite the fact that adult L2 learners attempt to speak before they have much experience with the perception of L2 sounds, there remains a short lag between improvement in perception and corresponding improvement in production. Summarizing the literature, Thomson (2022) reports that L2 perception scores are typically higher than L2 production scores for the same sound categories. Some counter-examples of L2 production accuracy surpassing perceptual accuracy do exist (Goto, 1971). However, most cases come from researchers whose explicit aim is to disprove the claim that perception precedes production (e.g., Borden et al., 1983; Bradlow et al., 1997; Sheldon & Strange, 1982). As Thomson (2022) notes, the reading task used to elicit speech in these studies allows learners to apply explicit knowledge of how to produce words based on spelling. Consequently, it is a strategy for bypassing natural progression, rather than proof that production can otherwise precede perception. There is also no evidence that such a strategy would extend to spontaneous communicative contexts.
In naturalistic contexts, when L2 pronunciation reaches a point where it is adequate for a learner’s communicative purposes, the motivation necessary to continue improving is likely to diminish. This may account for what Derwing and Munro (2015) have called the “Window of Maximal Opportunity” (WMO). While insufficient longitudinal data exists from which to accurately identify a precise point after which improvement in L2 pronunciation plateaus, evidence suggests that the WMO for obvious improvement in adult L2 learners’ pronunciation closes between six months and two years after their arrival in the L2 environment. The speed with which the WMO closes depends on individual learner characteristics (Piske et al., 2001). For example, not all learners struggle with the same L2 segments or suprasegmentals, even if they speak the same L1 (Derwing et al., 2012; Munro et al., 2015). Further, differences in phonetic aptitude predict ultimate attainment (Derwing & Munro, 2015; O’Brien et al., 2007). Some learners may be more fortunate than others in their access to input from the target speech community (Derwing et al., 2008), or they may have a stronger motivation to succeed (Baker Smemoe & Haslam, 2013; Moyer, 2014). Despite these limitations, there can be measurable improvement in individual sound categories beyond the WMO (Derwing & Munro, 2015). The effect of such incremental changes on L2 learners’ global foreign accent is limited, however. Derwing and Munro (2013) found no further improvement in English accent ratings for Slavic and Mandarin immigrants between two and seven years after their arrival in Canada. The Slavic group did evidence improvement in comprehensibility, however, which could be related to changes in pronunciation of individual sounds or suprasegmentals.
Instructed L2 Pronunciation Development
While limits on ultimate attainment in naturalistic learning contexts are well-established, instructed L2 pronunciation can provide an opportunity to re-orient learners’ selective perception towards phonetic cues that they have learned to ignore. Despite this fact, production-oriented approaches to teaching pronunciation have long-dominated the field (e.g., Celce-Murcia et al., 2010; Lyster et al., 2013; Thomson & Derwing, 2015). In the 1970s, Audiolingualism used spoken models, but only to introduce production rehearsal activities (Brown & Lee, 2015). In the 1980s, Communicative Language Teaching (CLT) largely neglected pronunciation instruction, focusing almost exclusively on communicative processes to the exclusion of form. While CLT appealed to adult L2 learners’ desire to communicate as quickly as possible, it largely recreated the naturalistic conditions under which L2 pronunciation is most resistant to change. In more recent years, while pronunciation instruction has regained a position of importance, instruction continues to be predominantly production-focused (e.g., Saito & Lyster, 2012).
Since we know that speech perception plays such a foundational role in accurate speech production, why is this not reflected in popular L2 pronunciation teaching methods? One reason for this disconnect is that L2 speech perception research is largely inaccessible to language teachers (Thomson, 2018a). It is typically published in technical journals and largely reports on laboratory-based studies, which may appear to lack relevance to the real world. Unfortunately, by ignoring this important research, pronunciation specialists are unprepared to teach in a way that addresses the cognitive-developmental features of pronunciation associated with speech perception (Derwing & Munro, 2015).
In laboratory studies, changes in L2 speech perception happen consistently and rapidly (Sakai & Moorman, 2018). Furthermore, perceptual training typically leads to improvement in production, albeit more slowly than in perception. The perceptual training technique with the most empirical support is High Variability Pronunciation Training (HVPT) (see Thomson, 2018a). This technique is based on evidence that to be effective, perceptual training needs to include variation in terms of the number of talkers whose voices comprise training stimuli and in the number of phonetic contexts (or words) in which target sounds are presented. Training on a single talker does not generalize to perception of the target sounds spoken by new talkers. Presentation of target sounds in multiple phonetic contexts is important because learning to perceive a sound in one context (e.g., the vowels in “hit” and “heat”) rarely transfers to perception of the same sound in different contexts (e.g., to the vowels in “sit” and “seat”) (Thomson, 2016, 2018a). There is also evidence that training target sounds using nonsense syllables/words is initially more effective than training using real words (Thomson & Derwing, 2016). This may be due to the ability to orient learners’ attention to sounds in nonsense words, without competition from meaning (Guion & Pederson, 2007). HVPT typically presents training stimuli by computer or mobile application. Learners hear training tokens and must respond by clicking/tapping on a symbol, letter, or word representing the sound that they just perceived. They then receive feedback on the accuracy of their responses. There is limited direct evidence for how to use HVPT to train suprasegmental features, but there is some indication that it would have a similarly beneficial effect (Thomson, 2018a).
The cognitive mechanisms underlying the benefits of HVPT over low variability training are not fully understood. One possibility is that all cognitive categories, by their very nature, contain variation. This means that learning a new category necessitates learning about the distribution of sounds that can occur within that category. Another possibility is that the use of multiple talkers maximizes the potential for a given L2 learner to encounter at least some tokens of target sounds that do not automatically assimilate to pre-existing L1 categories. There is evidence that L2 learners are more apt to recognize English vowels as belonging to a new category if those vowels were produced by a talker whose productions are acoustically distant from any confusable L1 vowel categories (Thomson, 2007). It remains unclear whether there is an optimal amount of variability. Programs claiming to be HVPT have used between 2 and 30 talkers, for example, and vary widely in the number of phonetic contexts utilized (Thomson, 2018a). It seems unlikely that using two talkers provides optimal variability, but using increasingly larger numbers may result in diminishing returns or make learning more difficult (Thomson, 2018a).