Vocal Entrainment in Multi-Party Conversations: An Exploration of Automated and Experimental Approaches
Published in Doctoral dissertation, University of Arizona, Tucson, USA, 2025
This dissertation assesses existing methodologies for measuring vocal entrainment and explores deep neural network architectures for modelling entrainment in spoken conversations. It also evaluates the impact of group size on entrainment models and discusses the challenges in multi-party vocal entrainment modelling. Experiment 1 utilises an existing benchmark (Nasir et al. 2020) for dyadic vocal entrainment modelling and reports its performance on multi-party spoken conversations. It also presents the findings from an annotation task designed to compare dyadic and multi-party conversations. The results show that existing vocal entrainment models are not sensitive to entrainment information in multi-party conversations. Moreover, the annotation task shows that multi-party conversations have complex and less predictable turn-taking patterns than dyadic conversations, thus highlighting a key difference between two- and multi-party conversations. Experiment 2 reports on the results for modelling dyadic vocal entrainment with an LSTM-based model and shows that this architecture is sensitive to entrainment information in natural conversations. Using local entrainment measures, this architecture can differentiate between conversations in which entrainment is present, and contexts in which it is not. The findings have implications for some key applications in entrainment modelling, such as identifying points of entrainment and disentrainment within conversations, analysing entrainment in ongoing conversations, and predicting entrainment in upcoming turns. In Experiment 3, I report on another implementation of the LSTM model for modelling local entrainment in multi-party conversations. I implement the LSTM model for two main tasks– evaluating the efficacy of transfer learning from two- to multi-party conversations, and evaluating the models training and test performance on multi-party conversational data. The results show that the LSTM model is sensitive to the quality of interaction between groups of speakers. They also demonstrate that local entrainment is a viable tool for modelling multi-party conversations. These experiments show that turn-level changes in acoustic features are a robust measure for entrainment and conversational naturalness, and can be successfully used as training data for entrainment models. The LSTM-based models are a good fit for building systems that can predict the voice characteristics of upcoming turns by learning from information in entrained speech. Further, the results show that entrainment information becomes available early in the conversation, making it possible to detect entrainment in ongoing conversations. Local entrainment measures can also be used to identify moments of co-operation and conflict within conversations. These experiments also shed light on the impact of group size and complex turn-taking on vocal entrainment, and offer potential solutions for overcoming the inherent challenges of modelling entrainment in multi-party spoken conversations.
Recommended citation: Krishnaswamy, M. (2025). Vocal Entrainment in Multi-Party Conversations: An Exploration of Automated and Experimental Approaches (Doctoral dissertation, The University of Arizona). https://repository.arizona.edu/bitstream/handle/10150/678258/azu_etd_22395_sip1_m.pdf
