Flowing Straighter with Conditional Flow Matching for Accurate Speech Enhancement

Logo

Flowing Straighter with Conditional Flow Matching for Accurate Speech Enhancement

View the Project on GitHub Mattias421/cfmse

Flowing Straighter with Conditional Flow Matching for Accurate Speech Enhancement

Mattias Cross, Anton Ragni

Current flow-based generative speech enhancement methods learn curved probability paths which model a mapping between clean and noisy speech. Despite impressive performance, the implications of curved probability paths are unknown. Methods such as Schrodinger bridges focus on curved paths, where time-dependent gradients and variance do not promote straight paths. Findings in machine learning research suggest that straight paths, such as conditional flow matching, are easier to train and offer better generalisation. Here we show the effect of path straightness on speech enhancement quality. We report experiments with the Schrodinger bridge, where we show that certain configurations are straighter. Conversely, we propose independent conditional flow-matching for speech enhancement, which models straight paths between noisy and clean speech. We identify empirically that a time-independent variance has a greater effect on sample quality than the gradient. Conditional flow matching improves several speech quality metrics, but requires more inference steps. We rectify this with a one-step solution by inferring the trained flow-based model as if it were directly predictive. Our work suggests that straighter time-independent probability paths improve generative speech enhancement over curved time-dependent paths.

Results on VB-DMD

We evaluate the performance of the proposed models on the VoiceBank-Demand (VB-DMD) dataset. The results are summarized in the table below.

Table: Ablation Study and Model Performance on VB-DMD. Values indicate the mean of the metrics.
Path Loss Inference k c PESQ ESTOI SI-SDR DNSMOS WhiSQA
Noisy - - - - 1.97 0.79 8.4 3.05 3.11
SB-VE DP ODE 2.6 0.4 2.92 0.87 19.3 3.56 4.46
SB-VE DP DDP 2.6 0.4 2.92 0.87 19.4 3.55 4.45
SB-VE DP ODE 0.99 0.375 2.92 0.88 19.5 3.56 4.47
SB-SV DP ODE 2.6 0.15 2.98 0.88 19.4 3.58 4.51
SB-SV DP DDP 2.6 0.15 2.98 0.88 19.9 3.58 4.50
SB-SV DP ODE 0.99 0.1 2.86 0.88 19.5 3.58 4.47
SB-SV DP DDP 0.99 0.1 2.99 0.88 20.0 3.57 4.50
ICFM DP ODE - 0.1 2.98 0.88 20.1 3.59 4.49
ICFM DP DDP - 0.1 3.05 0.88 20.2 3.58 4.51
ICFM FM ODE - 0.1 2.91 0.88 20.3 3.60 4.50
ICFM FM DDP - 0.1 3.00 0.88 20.4 3.59 4.51

Audio Examples

Select an audio file:   

Noisy Input:

Clean (Reference):

ICFM:

ICFM-DDP:

ICFM-FM:

ICFM-FM-DDP:

SBSV:

SBSV-DDP:

SBSV (k=1):

SBSV (k=1)-DDP:

SBVE:

SBVE-DDP:


Citation

TBD

This webpage is based off https://sp-uhh.github.io/gen-se

References