Agentic Synthesis of Spontaneous Speech: Modeling Entrainment and Cross-Dialect Friction for Robust Conversational ASR

status-driven communication bias cross-modal recognition rank stability agentic data augmentation conversational speech complexity Novelty: 3.2

Current automatic speech recognition and natural language understanding systems excel on scripted, single-speaker audio but degrade severely in real-world, spontaneous conversations. This performance gap stems from the chaotic nature of human dialogue, which is characterized by overlapping speech, cognitive disfluencies, and complex acoustic adaptations driven by social dynamics and dialectal differences. Because collecting and annotating authentic, highly disfluent multi-party audio is prohibitively expensive and fraught with privacy concerns, there is a critical need for synthetic data generation pipelines that accurately model the messy, relational, and psychoacoustic realities of human interaction.

Approach

We propose a multi-agent synthetic data generation framework that simulates spontaneous, multi-turn spoken dialogue by explicitly modeling interpersonal entrainment and dialectal perception. Building on constraint-conditioned multi-agent synthesis frameworks like [CulturePark](/paper/art_1185cf8015f249d0bfe242036881b0c9) and [GNOME](/paper/art_dc8ee4f93b254fabbd774f2629272ad8), our pipeline assigns distinct dialectal personas and social statuses (e.g., counselor vs. client) to LLM agents. As the agents converse, a secondary acoustic-cognitive module injects speech disfluencies (repairs, hesitations) and prosodic markup (pitch, speaking rate) based on the cognitive load of cross-dialect communication and the social rules of [Acoustic-Prosodic Entrainment](/paper/art_ff1fe6f6685546dbb978fc1a4caf0ce0). We explicitly control for verbosity to ensure stylistic matching is not a byproduct of message length, as demonstrated in [Understanding Confounding Effects in Linguistic Coordination](/paper/art_56ad845e9ea64098b344b7217f27cb75), before rendering the transcripts into audio via multi-speaker TTS.

Experimental Plan

We will evaluate our synthetic dataset by training a joint ASR and constituency parsing model, similar to the architecture in [Neural Constituency Parsing of Speech Transcripts](/paper/art_f1124681bb484221b45b29a9de118402), to simultaneously transcribe speech and detect disfluencies. Our primary hypothesis is that models fine-tuned on our entrainment- and dialect-aware synthetic data will significantly outperform baselines on real-world spontaneous speech benchmarks. We will evaluate Word Error Rate (WER) and disfluency detection F1 on the [Analysis of Disfluency in Children's Speech](/paper/art_92c1c9dbe3c449ad8b0ac6ff3c7f6522) dataset and the Spanish counseling dataset [MIDAS](/paper/art_7415ad072aef4f40a93fca907f88014c). Baselines will include standard Wav2Vec 2.0 and Whisper models fine-tuned only on read speech or standard single-pass synthetic dialogues, alongside an ablation study to measure the isolated impact of modeling cross-dialect perceptual friction.

Open Questions

How can multi-agent simulations be parameterized to generate synthetic conversational data that accurately reflects the spontaneous disfluencies and acoustic entrainment patterns of real human interactions?
Does training speech recognition models on synthetic dialogues with explicitly modeled cross-dialect perceptual friction improve their robustness to real-world conversational disfluencies?
Can the dynamic injection of linguistic coordination and hesitation markers into synthetic training data bridge the performance gap between read speech and spontaneous therapeutic dialogue?
Explore this research direction with an AI assistant
Powered by Althea