The subjective evaluation is conducted based on webMUSHRA:
We randomly selected testing samples from the et05_real subset in CHiME4, and asked the participants to evaluate the speech enahncement quality (speech intelligibility and denoising performance) of audios generated by different models, including ① original noisy speech (CH5), enhanced audios by ② BLSTM MVDR, ③ FasNet, ④ MC-Conv-TasNet, Beam-TasNet (both ⑤ sig-MVDR and ⑥ mask-MVDR), and ⑦ jointly trained MC-Conv-TasNet. The order of these audios are randomly shuffled for each testing sample, and a close-talk audio from CH0 is given as the reference.
Model | MOS | S-MOS | N-MOS |
---|---|---|---|
① Noisy Input (CH5) | 59.01 | 89.89 | 47.07 |
② BLSTM MVDR in [24] | 77.49 | 91.62 | 66.29 |
③ FaSNet [31] | 67.57 | 71.34 | 77.90 |
-------------------------------------- | ----------------- | ----------------- | --------------------- |
④ MC-Conv-TasNet | 51.82 | 57.42 | 63.85 |
⑤ → Beam-TasNet (sig-MVDR) | 69.45 | 79.71 | 64.75 |
⑥ → Beam-TasNet (mask-MVDR, 1-D) | 77.32 | 92.21 | 70.07 |
⑦ Jointly trained MC-Conv-TasNet + ASR | 74.61 | 74.13 | 85.17 |
The subjective evaluation is conducted in terms of the following criteria:
MOS: Determination of subjective global MOS. Select the category which best describes the heard sample for purpose of everyday speech communication. The OVERALL SPEECH SAMPLE was 100-Excellent / 80-Good / 60-Fair / 40-Poor / 20-Bad.
S-MOS: Determination of subjective speech MOS (S-MOS). Attending ONLY to the SPEECH SIGNAL, select the category which best describes the heard sample. The SPEECH SIGNAL in this sample was 100-Not Distorted / 80-Slightly Distorted / 60-Somewhat Distorted / 40-Fairly Distorted / 20-Very Distorted.
N-MOS: Determination of subjective noise MOS (N-MOS). Attending ONLY to the BACKGROUND, select the category which best describes the heard sample. The BACKGROUND in this sample was 100-Not Noticeable / 80-Slightly Noticeable / 60- Noticeable But Not Intrusive / 40-Somewhat Intrusive / 20-Very Intrusive.