How do machines recognize who is speaking? In this talk, Gasser Elbanna dives into how speech-based neural network models represent speaker identity. While humans intuitively recognize voices, capturing the same in artificial systems remains a major challenge—especially with variations across and within speakers.
This study explores self-supervised models (SSMs) — including generative, predictive, and contrastive models — alongside traditional supervised models and handcrafted acoustic features. By analyzing how these models handle changes in acoustic, phonemic, prosodic, and linguistic features, the team reveals key insights into model interpretability and the parallels with human voice perception.
Whether you're interested in machine learning, neuroscience, or speech processing, this talk sheds light on the frontiers of understanding speaker identity through deep learning.
00:00 speaker’s career intro
02:50 Talk outline
04:06 Speech representation
11:22 Learning paradigms
30:03 Speaker identity perception
55:40 Takeaways
コメント