Abstract
Neural speech codecs have demonstrated their ability to compress high-quality speech and audio by converting them into discrete token representations. Most existing methods utilize Residual Vector Quantization (RVQ) to encode speech into multiple layers of discrete codes with uniform time scales. However, this strategy overlooks the differences in information density across various speech features, leading to redundant encoding of sparse information, which limits the performance of these methods at low bitrate. This paper proposes a novel multi-scale neural speech codec named MsCodec that encodes speech into multiple layers of discrete codes, each with a distinct time scale. This encourages the model to decouple speech features according to their diverse information densities, consequently enhancing the performance of speech compression. Furthermore, we incorporate mutual information loss to augment the diversity among speech codes across different layers. Experimental results indicate that our proposed method significantly improves codec performance at low bitrate.
Figure.1 Architecture of MsCodec.
Samples of Various Codecs
GT | EnCodec-4VQ 3000bps |
HiFiCodec-4VQ 3000bps |
TiCodec-4VQ 3000bps |
MsCodec-L 2800bps(Ours) |
---|---|---|---|---|
GT | EnCodec-2VQ 1500bps |
HiFiCodec-2VQ 1500bps |
TiCodec-2VQ 1500bps |
MsCodec-M 1400bps(Ours) |
---|---|---|---|---|
GT | EnCodec-1VQ 750bps |
HiFiCodec-1VQ 750bps |
TiCodec-1VQ 750bps |
MsCodec-S 700bps(Ours) |
---|---|---|---|---|
GT | MsCodec-M 1400bps |
w/o MI Loss 1400bps |
w/o Multi-Scale 1500bps |
---|---|---|---|