Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors

Chetupalli SR, Habets E (2022)

Publication Type: Conference contribution

Publication year: 2022

Publisher: International Speech Communication Association

Book Volume: 2022-September

Pages Range: 5393-5397

Conference Proceedings Title: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Event location: Incheon, KOR

DOI: 10.21437/Interspeech.2022-10849

Abstract

Speaker-independent speech separation for single-channel mixtures with an unknown number of multiple speakers in the waveform domain is considered in this paper. To deal with the unknown number of sources, we incorporate an encoder-decoder attractor (EDA) module into a speech separation network. The neural network architecture consists of a trainable encoder-decoder pair and a masking network. The mask network in the proposed approach is inspired by the transformer-based SepFormer separation system. It contains a dual-path block and a triple path block, each block modeling both short-time and long-time dependencies in the signal. The EDA module first summarises the dual-path block output using an LSTM encoder and generates one attractor vector per speaker in the mixture using an LSTM decoder. The attractors are combined with the dual-path block output to generate speaker channels, which are processed jointly by the triple-path block to predict the mask. Further, a linear-sigmoid layer, with attractors as the input, predicts a binary output to indicate a stopping criterion for attractor generation. The proposed approach is evaluated on the WSJ0-mix dataset with mixtures of up to five speakers. State-of-the-art results are obtained in the speech separation quality and speaker counting for all the mixtures.

Authors with CRIS profile

Emanuël Habets Lehrstuhl für Sprach- und Akustische Signalverarbeitung

How to cite

APA:

Chetupalli, S.R., & Habets, E. (2022). Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 5393-5397). Incheon, KOR: International Speech Communication Association.

MLA:

Chetupalli, Srikanth Raj, and Emanuël Habets. "Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors." Proceedings of the 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, Incheon, KOR International Speech Communication Association, 2022. 5393-5397.

BibTeX: Download