SProSIG Lecture Series

The Speech Prosody SIG Lecture Series, an initiative of the Speech Prosody SIG Officers, aims to (1) offer to the Speech Prosody community a well-covered view of themes and methods in speech prosody; (2) introduce new perspectives and foster debate; (3) stimulate collaborations among speech prosody researchers, including by making known to the community the existence of public repositories with data, corpora, joint projects asking for collaboration and other resources that can be freely shared. Lectures will be presented live in YouTube, with Q&A, handled through the YouTube's chat feature.

Upcoming Talks

Xiaoming Jiang, September 4th, 1pm Campinas time
Peggy Mok, October 21st, 8am Campinas time
Sacha Calhoun, November 17th, 5pm Campinas time

Archived Lectures

Tackling prosodic phenomena at their roots

Professor Yi Xu, University College London. October 25, 2023.

Abstract: Rather than being a coherent whole, speech prosody consists of highly diverse phenomena that are best understood in terms of their communicative functions, together with specific mechanisms of articulatory encoding and perceptual decoding. The understanding of these root causes is therefore key to further advances in prosody research.

archived talk at YouTube and at bilibili

Segmental Articulations and Prosody

Malin Svensson Lundmark, Lund University, November 23rd, 2023

Abstract: This lecture will be on an aspect of the articulatory-acoustics relationship that is rarely addressed but which is both stable and robust across, e.g., places of articulation, tonal context and prosodic levels. It’s about the acceleration and deceleration of articulatory movements and how they coincide with acoustic segment boundaries.

archived talk at Youtube and at biliblili

December 14, 2023.

How to handle variability in the study of intonation

Amalia Arvaniti, Radboud University, Netherlands.

Abstract: This talk will give an overview of the issue of variability in intonation and present methodological approaches that render variability easier to handle. These methodologies are presented by means of a case study, the English pitch accents H* and L+H*, which are treated as distinct phonological entities in some accounts but as endpoints of a continuum in others. The research that will be presented sheds light on the reasons for the disagreement between analyses and the discrepancies between analyses and empirical evidence, by examining both production data from British English unscripted speech and perceptual data, which also link the processing of the two accents to the participants’ levels of empathy, musicality, and autistic-like traits.

archived talk at Youtube and at bilibili

The speech synthesis phoneticians need is both realistic and controllable: A survey and a roadmap towards modern synthesis tools for phonetics.

Zofia Malisz, KTH Royal Institute of Technology.

April 17th, 2024. archived talk, also at bilibili.

ABSTRACT
In the last decade, data and machine learning-driven methods to speech synthesis have greatly improved its quality. So much so, that the realism achievable by current neural synthesisers can rival natural speech. However, modern neural synthesis methods have not yet transferred as tools for experimentation in the speech and language sciences. This is because modern systems still lack the ability to manipulate low-level acoustic characteristics of the signal such as e.g. formant frequencies.
In this talk, I survey recent advances in speech synthesis and discuss their potential as experimental tools for phonetic research. I argue that speech scientists and speech engineers would benefit from working more with each other again: in particular, in the pursuit of prosodic and acoustic parameter control in neural speech synthesis. I showcase several approaches to fine synthesis control that I have implemented with colleagues: the WavebenderGAN and a system that mimicks the source-filter model of speech production. These systems allow to manipulate formant frequencies and other acoustic parameters with the same (or better) accuracy as e.g. Praat but with a far superior signal quality.
Finally, I discuss ways to improve synthesis evaluation paradigms, so that not only industry but also speech science experimentation benchmarks are met. My hope is to inspire more students and researchers to take up these research challenges and explore the potential of working at the intersection of the speech technology and speech science.

Outline: 1. I discuss briefly the history of advancements in speech synthesis starting in the formant synthesis era and explain where the improvements came from. 2. I show experiments that I have done that prove modern synthesis is processed not differently than natural speech by humans in a lexical decision task as evidence that the realism (“naturalness”) goal has been largely achieved. 3. I explain how realism came at the expense of controllability. I show how controllability is an indispensable feature for speech synthesis to be adopted in phonetic experimentation. I survey the current state of research on controllability in speech engineering - concentrating on prosodic and formant control. 4. I propose how we can fix this by explaining the work I have done with colleagues on several systems that feature both realism and control. 5. I sketch a roadmap to improve synthesis tools for phonetics - by placing focus on benchmarking systems according to scientific criteria.

Predictive Modelling of Turn-taking in Spoken Conversation

Gabriel Skantze, KTH

May 15, 2024. archived talk at Youtube, and at Bilibili

Abstract: Conversational interfaces, in the form of voice assistants, smart speakers, and social robots are becoming ubiquitous. This development is partly fuelled by the recent developments in large language models. While this progress is very exciting, human-machine conversation is currently limited in many ways. In this talk, I will specifically address the modelling of conversational turn-taking. As current systems lack the sophisticated coordination mechanisms found in human-human interaction, they are often plagued by interruptions or sluggish responses. I will present our recent work on predictive modelling of turn-taking, which allows the system to not only react to turn-taking cues, but also predict upcoming turn-taking events and produce relevant cues to facilitate real-time coordination of spoken interaction. Through analysis of the model, we also learn about which cues are relevant to turn-taking, including prosody and filled pauses.

Plan:

Introduction to conversational systems and human-robot interaction
Why turn-taking is problematic in current systems
Voice Activity Projection: A predictive, data-driven model of turn-taking
Analysis of the model (prosody and filled pauses)
Towards better turn-taking in conversational systems

Syntagmatic Prominence Relations in Prosodic Focus Marking

Lecturer: Simon Roessig (University of York, UK)

Sept 24th, 2024. Archived talk at YouTube, and at Bilibil.

Abstract: This talk is about the role of prenuclear prominences and their relation to nuclear accents in German and English. The production results (German) that I will present show that the realization of the prenuclear domain depends on whether it is focal or prefocal. The prenuclear noun is characterized by larger F0 excursions, higher F0 maxima, and longer durations when it is in broad focus than when it precedes a narrow focus. Furthermore, the realization of the prenuclear domain depends on the following focus type: The prenuclear noun is produced with smaller F0 excursions, lower F0 maxima and shorter durations before a corrective focus than before a non-corrective narrow focus. The findings suggest that the phonetic manifestation of information structure is distributed over larger prosodic domains with an inverse relationship in the syntagmatic dimension. In addition, the study contributes further evidence that continuous phonetic detail is used to encode information structural categories. An important question that arises from the production data is whether this phonetic detail can be used by listeners in perception. I will present first results from a series of perception experiments (German and English) to investigate this question.

Plan: 1. I will begin by outlining what we know about focus prosody in the nuclear and prenuclear domains. 2. I will then present findings from a production study that examines the prosody of the prenuclear domain in different types of focus. 3. These results show that there are interesting strength relations between prenuclear and nuclear prosody in the encoding of focus types. 4. I will present preliminary findings from perception experiments investigating the question whether listeners use prenuclear prominence modulations in identifying focus types. 5. Finally, I will conclude with a discussion of the results and future directions.

Sam Tilsen, Department of Linguistics, Cornell University

On the intermittency of speech and the lack of compelling evidence for phrasal rhythm or hierarchical prosodic phrase structure

Oct. 23rd, 2024

archived at YouTube and at Bilibili

Abstract: On the timescale of prosodic phrases, temporal patterns in conversational speech are not very regular. To the contrary, spurts of fluent speech tend to be highly irregular and intermittent. Hesitations and pauses are common. What is the mechanism behind this pattern? In this talk I consider and reject two possible explanations. First, I examine the possibility that a hierarchical organization of relatively long-timescale prosodic units might explain intermittency. Several predictions of hierarchical prosodic structure accounts are examined in an analysis of the Switchboard NXT corpus, but the empirical patterns are not very consistent with those predictions. Moreover, I argue that even laboratory studies that purport to find evidence for hierarchical phrase structure suffer from flawed argumentation. For these reasons, a hierarchical structure-based account of intermittency is suspect. Second, I examine the possibility that there is a phrase-timescale oscillator that governs phrase initiation. I critique recent studies that have argued that such an oscillator is involved in speech production. Through model simulations I show that in order to adequately capture empirical timing patterns, such an oscillator would need an overly powerful ability to change frequency from cycle to cycle. On top of this, the neurophysiological basis for a role of oscillation in governing phrasal timing is called into question. Instead of structural or oscillation-based mechanisms being responsible for phrasal timing, I argue that intermittency arises due to mechanisms responsible for the organization of syntactic and conceptual systems. I present a model in which phrase initiation is contingent on the achievement of a coherent state among those systems, and show how stochastic influences can generate hesitative phenomena that may be the basis for the intermittency of speech.

Plan:

Speech activity is intermittent on the timescale of prosodic phrases.
Hierarchical prosodic phrase structure does not explain intermittency.
An oscillatory production mechanism does not explain intermittency.
A model that accounts for intermittency is presented, in which syntactic and conceptual systems must achieve a coherent state before production is initiated.
Implications of the model are discussed.

Controlling and Probing Generative End-to-end Models: New Opportunities for Research on Prosody

Gerard Bailly, CNRS, Grenoble-Alps University, November 27, 2024.

Archived talk: at YouTube, and at bilibili

Abstract: During decades, the interplay of phonetic forms, phonological structures and communicative functions was mainly questioned via meticulous analysis of acoustic or multimodal performance of speakers and listeners... with the ambition of providing technology with principles, controls and constraints emerging from our human creativity. This golden age of human intelligence is now largely defeated by model-free generative AI, in particular text-to-speech systems that provide signals or videos of speaking faces often misconfused with natural data!

In this presentation, I will argue for a positive attitude: consider these high quality end2end models as a proxy to capture lawful data variability and develop new tools to explore internal representations (so-called latent spaces) built by these successful models. I will further detail two works started with my colleagues Olivier Perrotin and Martin Lenglet: (1) exploration and control of phonetic and phonological embeddings via causal regression (Lenglet et al, Interspeech 2022 & submitted to CSL); (2) exploration and fine control of audiovisual attitudes via verbal tags (Bailly et al, LREC/COLING 2024).

Plan: 1. Brief review of latent space analysis and causal regression 2. Brief review of end-to-end text-to-speech technology 3. Exploration and control of phonetic and phonological embeddings via causal regression 4. Exploration and fine control of audiovisual emotion via verbal tags 5. Take-home message: Generative models as a proxy to data mining

Speech prosody and social meaning

Robert Xu, Harvard University

Wednesday, Dec. 18th

Viewing Links: at Youtube, at Bilibili

Abstract: This talk explores how third-wave sociolinguistic theories can deepen our understanding of the structure and function of speech prosody, particularly in social interactions. Key concepts such as indexicality, style, stancetaking, and enregisterment will be introduced and exemplified through an ecologically-minded study of prominent social types in Beijing. I will demonstrate how prosodic features—such as pitch variation, voice quality, and timing—serve as semiotic resources for constructing these social types within a socio-cultural landscape. These prosodic elements not only interact with each other but also conspire with the body and conversation structures to convey dialogical social meanings. These meanings enable the constructed social types to mediate broader social relationships, structures, and transformations.

Plan:

The theory of indexicality
Stancetaking, enregisterment, and personhood
Pitch variation as semiotic resources
Voice quality and the body
Timing variation and conversation structure

Early perception of prosody in atypical language acquisition.

Sonia Frota, University of Lisbon

April 24, 2025

Video Link, also at Bilibili

Abstract: Infants’ early sensitivity to prosody has supported the view that prosody might facilitate language acquisition. Recent research has suggested that early prosodic development is crucially shaped by language experience. However, typical and atypically developing infants may vary in their language experience, and it is largely unknown whether and how early prosodic development differs in these two populations. Therefore, the potential of prosody to scaffold language learning in atypical development remains to be determined. I will present findings from speech perception experiments focusing on the perception of stress and intonation patterns during the first year of life, in infants at-risk for language impairments and infants with Down Syndrome. The early perception abilities from the atypical groups will be compared to those of their typically developing peers. The results suggest different developmental paths for early perception of stress and intonation across groups, and highlight the importance of prosody also in atypical language acquisition. Moreover, the decrease in early sensitivity to prosody in older atypically developing infants pinpoints a crucial developmental window for early interventions using prosody to support language learning in this population.

5-line plan of the presentation:

Early sensitivity to prosody and early prosodic development
Results from experiments on the perception of stress in atypically developing infants
Results from experiments on the perception of intonation in atypically developing infants
Comparing prosodic abilities in atypical groups and typically developing infants
Discussion and implications for remediation and intervention strategies to support language acquisition

Perception-based phonology is about time

Aviad Albert, University of Cologne.

June 11, 2025, 1 pm (Brasilia time, UTC -3)

Viewing Link at Yahoo and at Bilibili

Abstract: Our auditory system evolved to process incoming information at specific timescales, each with distinct effects. At the slower timescale, we temporally integrate isolated events, enabling us to effectively detect relationships such as slow vs. fast, and regular vs. irregular. At the faster timescale, we integrate events at such a rapid pace that they appear continuous, allowing us to discern relationships like high vs. low (frequency) and harmony vs. noise. This reduction of auditory input into continuous (rather than discrete) primitives can significantly contribute to our models of phonology and phonetics, with emphasis on prosody. In this talk, I will demonstrate how timescale-based models of prosody illuminate the universality of syllables and the phonetic basis of sonority. I will present the ProPer toolbox for acoustic analysis, to illustrate how these perspectives can yield a novel system for analyzing pitch contours and strength/weight relations across syllabic intervals of speech signals. Furthermore, I will argue that this framework can shed light on the notion of speech rhythm. I claim that speech rhythm and musical rhythm differ in their ultimate goals, while exploiting the same temporal space. This suggests that rhythm is a timescale, and isochrony is a uniquely musical goal. Consequently, speech rhythm should be treated as a dynamic moving target that could be more adequately represented like pitch contours, in terms of time series trajectories.

5-line plan of the presentation:

Present an audio-visual illustration of the two perceptual regimes (PRiORS).
Show links from general audition to linguistic processing (universal syllable).
Discuss sonority and NAP-based models as alternatives to the SSP.
Present the ProPer toolbox (PROsodic analysis with PERiodic energy).
. Discuss speech rhythm given PRiORS, show ProPer implementation.

September 10, 2025

Viewing Link: https://www.youtube.com/live/vAMy_Y4Fh0Q, and at Bilibili

Autism and (what it can teach us about) prosody

Simon Wehrle, University of Cologne, Germany

Abstract:

Prosody has played a key role in accounts of autism since it was first described in the 1940s. In this talk, I will draw on recent analyses of interactions between (German-speaking) autistic adults to 1) illustrate what appear to be typical features of prosody and conversation in autism, and 2) reflect on what these results and the methods used to uncover them can teach us about (research on) prosody.

After introducing definitions and conceptions of autism and summarising a recent review on linguistic prosody in autism, I will present experimental results from two corpora of (semi-)spontaneous interaction. Importantly, and in direct contrast to the vast majority of research on communication in autism, our participants were a) adults, b) engaged in dialogic interaction, c) grouped into matched homogenous dyads (e.g. autistic–autistic), and d) speaking a language other than English.

The key dimensions of communication I will focus on are: 1) intonation style—the melodicity and diversity of pitch contours; 2) turn-taking and backchannels—the organisation of who speaks when and the use of feedback signals; 3) eye gaze—the occurrence of (mutual) eye contact in face-to-face interaction.

I will use broad conceptions of both neurodiversity and prosody to highlight the general importance of individual-specific behaviour, conversational context, and multimodal interaction. Taken together, these perspectives point towards a holistic and inclusive model of communicative interaction centered on the importance of cultural and cognitive diversity.

Plan:

Overview—autism and prosody
Intonation—melodicity, and the importance of individual variability
Conversation—turn-taking, and the importance of context
Multimodality—eye gaze, and the importance of face-to-face interaction
Conclusion—perspectives and horizons

Intonational form-meaning relationship : some insights from French

Cristel PORTES, Univ. Aix-Marseille, France

November 19, 2025,

Video link, also at Bilibili.

Abstract: It is generally claimed that there is no one-to-one mapping between intonation and meaning. In this talk, I will argue that it is nevertheless crucial to study intonational meaning and reflect on how to do it, drawing mainly on data from French intonation. After showing that the "no one-to-one mapping" does not prevent the assumption of “linguistic normalcy” for intonation, I will show the importance of defining intonational meaning in its specificity and how combining corpus analysis with experimentation is useful for doing so. I will then discuss the relevance of the concepts of compositionality and duality of patterning, before addressing the crucial issue of variability in its various aspects. I will conclude by emphasizing the importance of studying meaning in order to understand and explain form, and vice versa.

Plan 0) About the "no one-to-one mapping" claim; 1) Addressing the specificity of intonational meaning; 2) Combining corpus analysis and laboratory phonology; 3) Compositionality and duality of patterning; 4) Accounting for intonational form-meaning variability; 5) Conclusion : the form-meaning relationship at the heart of linguistic inquiry.

L2 Prosody teaching and learning: Can gesture lend a hand?

Lieke van Maastrict, Radboud University Nijmegen
December 10, 2025

Viewing Link, also at Bilibili.

Abstract: In our multilingual society, communicating in foreign languages (L2) is increasingly important but complicated by a lack of existing methods for L2 prosody training. While L2 learners often practice individual L2 segments in class, they barely receive instruction on the form or function of the prosodic features of their L2 (e.g., the use of pitch accents to mark discourse focus). However, prosody is essential for communication, and its understanding has both theoretical repercussions and societal relevance. In short, L2 researchers, teachers, and learners are starting to recognize the importance of L2 prosody acquisition, but how do we overcome the challenges that it presents, especially in an L2 classroom?

In my talk, I will first discuss what we know (from my own work and that of others) about the L2 acquisition of different prosodic features, regarding the factors that influence L2 development, as well as some effects of prosodic errors on L1-L2 communication. Second, I will present research on L2 prosody production and perception in combination with gestural information. Can perceiving or producing gestures while learning the prosody of an L2 benefit learners?

Five-point plan: 1. L2 prosody acquisition: Is it possible? 2. Which factors influence L2 prosody acquisition? 3. Effects of incorrect prosody on communication? 4. Can producing gestures help learners acquire L2 prosody? 5. Can perceiving gestures help learners acquire L2 prososdy?

Lecture Series Organizer: Plinio A. Barbosa, University of Campinas, Brazil

Of related interest: Ward and Levow’s Prosody Tutorial Video Series