A subscription to JoVE is required to view this content. Sign in or start your free trial.
This research aimed to make a comparison between L1-L2-English and L1-L2 Portuguese to check how much the effect of a foreign accent accounts for both metrics and prosodic-acoustic parameters, as well as for the choice of the target voice in a voice lineup.
This research aims to examine both the prosodic-acoustic features and the perceptual correlates of foreign-accented English and foreign-accented Brazilian Portuguese and check how the speakers' productions of foreign and native accents are correlated to the listeners' perception. In the Methodology, we conducted a speech production procedure with a group of American speakers of L2 Brazilian Portuguese and a group of Brazilian speakers of L2 English, and a speech perception procedure in which we performed voice lineups for both languages.For the speech production statistical analysis, we ran Generalized Additive Models to evaluate the effect of the language groups on each class (metric or prosodic-acoustic) of features controlled for the smoothing effect of the covariate(s) of the opposite class. For the speech perception statistical analysis, we ran a Kruskal-Wallis test and a post-hoc Dunn's test to evaluate the effect of the voices of the lineups on the scores judged by the listeners. We nevertheless conducted acoustic (voice) similarity tests based on Cosine and Euclidean distances. Results showed significant acoustic differences between the language groups in terms of variability of the f0, duration, and voice quality. For the lineups, the results indicated that prosodic features of f0, intensity, and voice quality correlated to the listeners' perceived judgments.
The accent is a salient and dynamic aspect of communication and fluency, both in the native language (L1) and in a foreign language (L2)1. Foreign accent represents the L2 phonetic features of a target language, and it can change (over time)in response to the speaker’s L2 experience, speaking style, input quality, exposition, among other variables. A foreign accent can be quantified as a (scalar) degree of difference between L2 speech produced by a foreign speaker and a local or reference accent of the target language2,3,4,5.
This research aims to examine both the prosodic-acoustic features and the perceptual correlates of foreign-accented English and foreign-accented Brazilian Portuguese (BP), as well as to check to which extent the speakers' productions of foreign and native accents are correlated to the listeners' perception. Prior research in the forensic field has demonstrated the robustness of vowels and consonants in foreign accent identification as either being stable for a long-term analysis of an individual (The Lo Case6) or referring to the high ID accuracy of a speaker (The Lindbergh Case7). However, the exploration of prosodic-acoustic features based on duration, fundamental frequency (f0, i.e., the acoustic correlate of pitch), intensity, and voice quality (VQ) has gained increasing attention8,9. Thus, the choice for prosodic-acoustic features in this study represents a promising avenue in the forensic phonetics field8,10,11,12,13.
The present research is supported by studies dedicated to foreign accents as a form of voice disguise in forensic cases14,15, as well as in the preparation of voice lineups for speaker recognition16,17. For instance, speech rate played an important role in the identification of German, Italian, Japanese, and Brazilian speakers of English18,19,20,21,22. Besides speech rate, long-term spectral and f0 features challenge L2 proficient speakers with familiarity with the target language because the brain and cognitive bases suffer a deficit in memory, attention, and emotion, reflecting in the speaker's phonetic performance during long speech turns23. The view of foreign accents in the forensic field is that the definition of what really sounds like a foreign accent depends much on the listener's unfamiliarity rather than having a none-to-extreme degree of foreign-accented speech24.
In the perceptual domain, a common forensic tool used since the mid-1990s for criminals' recognition is the Voice Lineup (auditory recognition), which is analogous to visual recognition used to identify a perpetrator in a crime scene16,25. In a voice lineup, the suspect's voice is presented alongside foils-voices similar in sociolinguistic aspects such as age, sex, geographical location, dialect, and cultural status-for identification by an earwitness. The success or failure of a voice lineup will depend on the number of voice samples and the sample durations25,26. Furthermore, for real-world samples, it is considered that the audio quality consistently impacts the accuracy of voice recognition. Poor audio quality can distort the unique characteristics of a voice27. In the case of voice similarity, fine phonetic detail based on f0 can confuse the listener during voice recognition28,29. Such acoustic features extend beyond f0 and encompass elements of duration, and spectral features of intensity and VQ30. This view of multiple prosodic features is crucial in the context of forensic voice comparison to ensure accurate speaker identification9,14,15,29,31.
In summary, studies in forensic phonetics have shown some variation regarding foreign accent identification over the last decades. On the one hand, a foreign accent does not seem to affect the process of identifying a speaker32,33 (especially if the speaker is unfamiliar with the target foreign accent34). On the other hand, there are findings in the opposite direction12,34,35.
This work received approval from a human research ethics committee. Furthermore, informed consent was obtained from all participants involved in this study to use and publish their data.
1. Speech production
NOTE: We collected speech from a reading task on both 'L1 English-L2 BP' produced by Group 1: The American English (from the U.S.A.) Speakers (AmE-S), and on both 'L1 BP-L2 English' produced by Group 2: The Brazilian Speakers (Bra-S). See Figure 1 for a flowchart for speech production.
Figure 1: Schematic flowchart for speech production. Please click here to view a larger version of this figure.
Figure 2: Screenshot from phonetic alignment using MAUS forced aligner. (A) The dashed rectangle is meant for dragging and dropping 'my_file.wav'/' my_file.txt' files or clicking inside for searching such files from the folder; the upload button is indicated by the red arrow. (B) The uploaded files from panel A (see blue arrow), the pipeline to be used, the language for the pairs of files, the file format to be returned, and a 'true/false' button for keeping all files. (C) The checkbox terms of Usage (see green arrow), the Run Web Services button, and the Results (TextGrid files to be downloaded). Please click here to view a larger version of this figure.
Figure 3: Screenshot of the realignment procedure. (A) Input settings form for the realignment procedure. (B) Partial waveform, broadband spectrogram with f0 (blue) contour, and six tiers segmented (and labeled) as tier 1: units of vowel onset to the next vowel onset (V_to_V); vowel onset (V_to_V); tier 2: units of vowel (V), consonant (C), and pause (#); tier 3: phonic representations V_to_V; tier 4: some words from the text; tier 5: some chunks (CH) of speech from the text; tier 6: tonal tier containing the highest (H), and the lowest (L) tone of each speech chunk produced by a female AmE-S. (C) Input settings for the automatic extraction of the acoustic features. Please click here to view a larger version of this figure.
2. Speech perception
NOTE: We carried out four voice lineups in English with American listeners and four lineups in BP with Brazilian listeners. See Figure 4 for a flowchart for speech perception.
Figure 4: Schematic flowchart for speech perception. Please click here to view a larger version of this figure.
Figure 5: Directory setup for speech perception. Lineup folders. Each folder contains Six L1 voices, the L2 target voice, the "CreateLineup" script, and the voice lineup audio file (returned after running the script). Please click here to view a larger version of this figure.
Results for speech production
In this section, we described the performance of the statistically significant prosodic-acoustic features and rhythm metrics. Such prosodic features were speech, articulation, and pause rates, which are related to duration, and shimmer, which is related to voice quality. The rhythm metrics were standard deviation (SD) of syllable duration, SD of consonant, SD of vocalic or consonantal duration, and the variation coefficient of syllable duration (see the Supplem...
The current protocol presents a novelty in the field of (forensic) phonetics. It is divided into two phases: one based on production (acoustic analysis) and one based on perception (judgement analysis). The production analysis phase comprises the Data preparation and Forced alignment, Realignment, and Automatic extraction of prosodic-acoustic features besides the statistics. This protocol connects the stage of data collection to the data analysis in a faster and more efficient way than the traditional protocols based on ...
The authors have no conflicts of interest to declare.
This study was supported by the National Council for Scientific and Technological Development - CNPq, grant no. 307010/2022-8 for the first author, and grant no. 302194/2019-3 for the second author. The authors would like to express their sincere gratitude to the participants of this research for their generous cooperation and invaluable contributions.
Name | Company | Catalog Number | Comments |
CreateLineup | Personal collection | # | Script for praat for voice lineups preparation |
Dell I3 (with solid-state drive - SSD) | Dell | # | Laptop computer |
Praat | Paul Boersma & David Weenink | # | Software for phonetic analysis |
Python 3 | Python Software Foundation | # | Interpreted, high-level, general-purpose programming language |
R | The R Project for Statistical Computing | # | Programming language for stattistical computing |
Shure Beta SM7B | Shure | # | Microphone |
SpeechRhythmExtractor | Personal collection | # | Script for praat for automatic extraction of acoustic features |
SurveyMonkey | SurveyMonkey Inc. | # | Assemble of free customizable surveys, as well as a suite of back-end programs that include data analysis, sample selection, debiasing, and data representation. |
Tascam DR-100 MKII | Tascam | # | Digital voice recorder |
The Munich Automatic Segmentation System MAUS | University of Munich | # | Forced-aligner of audio (.wav) and linguistic information (.txt) files |
VVUnitAligner | Personal collection | # | Script for praat for automatic realignment and post-processing of phonetic units |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved