Quality Benchmarks Between audiotok / webrtcvad / pyannote-audio / silero-vad

Instruments

We have compared 4 easy-to-use off-the-shelf instruments for voice activity / audio activity detection with off-the-shelf parameters (we did not hack / rebuild / retrain the tools and just used them as-is):

A popular python version of the webrtcvad - https://github.com/wiseman/py-webrtcvad;
Audiotok from this repo;
Silero-vad from here - https://github.com/snakers4/silero-vad;
Speech activity detector from pyannote - https://github.com/pyannote/pyannote-audio-hub#speech-activity-detection

Caveats

Full disclaimer - we are mostly interested in portable production voice detection in phone calls, not just silence detection or dataset preparation where all of these tools will more or less work just fine;
In our extensive experiments we noticed that WebRTC is actually much better in detecting silence than detecting speech (probably by design). It has a lot of false positives when detecting speech;
audiotok provides Audio Activity Detection, which probably may just mean detecting silence in layman’s terms;
silero-vad is geared towards speech detection (as opposed to noise or music);
A sensible chunk size for our VAD is at least 75-100ms (pauses in speech shorter than 100ms are not very meaningful, but we prefer 150-250ms chunks, see quality comparison here), while audiotok and webrtcvad use 30-50ms chunks (we used default values of 30 ms for webrtcvad and 50 ms for audiotok );
Maybe we missed something, but using pyannote had a ton of performance caveats, i.e. it required a GPU out of the box and worked very slowly for small files (but quite fast for long files). Also as far as we dug, streaming / online application was not possible with pyannote with standard provided pipeline;

Testing Methodology

Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology

Known limitations:

Usually speech is very self-correlated (i.e. if one person speaks, he will continue speaking for some time), but our test is extremely hard, because it essentially gives an algorithm only one chance. Essentially we test how well all of algorithms determine the first / last frame of speech without the luxury of being “in the middle” of speech;
Since we wanted to provide PR curves, it was a bit tricky with py-webrtcvad and pyannote without essentially rebuilding / modifying C++ code for py-webrtcvad and without modifying pipeline code for pyannote. Therefore we had to interpret binary decisions made by these tools into probabilities or just provide one dot on the curve;
For production usage of silero-vad or to use it with different new domains, you have to set at least some of the params properly. It can be done via provided plotting tool;

Testing Results

Finished tests:

Portability, Speed, Production Limitations

Looks like originally webrtcvad is written in С++ around 2016, so theoretically it can be ported into many platforms;
I have inquired in the community, the original VAD seems to have matured and python version is based on 2018 version;
Looks like audiotok is written in plain python, but I guess the algorithm itself can be ported;
silero-vad is based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);
Looks like pyannote also is built with PyTorch, but it requires GPU out of the box and is extremely slow with short files;

Tool	Speed	Streaming	GPU	Portability
py-webrtcvad	extremely fast	yes	not required	you can build and port
auditok	very fast	yes	not required	python only, you can try porting
silero-vad	very fast	yes	not required	PyTorch and ONNX
pyannote	fast for long files slow for small files	no	required	PyTorch

Also we ran a 30-minute audio file via pyannote and silero-vad:

silero-vad - 20 seconds on CPU;
pyannote - 12 seconds on 3090;

This is by no means an extensive and full research on the topic, please point out if anything is lacking.

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:5 (3 by maintainers)

Top GitHub Comments

4reactions

hbredincommented, Apr 14, 2021

FYI, I just made public a preprint in which I use my own (probably unfair) benchmark – comparing silero-vad with default hyper-parameters and pyannote.audio new pyannote/segmentation pretrained model.

3reactions

snakers4commented, Feb 5, 2021

Hi,

“We often encountered that people wrongfully compare general domain-agnostic solutions with solutions heavily tuned on particular domains.” Isn’t it what you just did in this benchmark? 😉

There are not so many public voice detectors. We simply tried comparing the few that have any semblance of examples / docs / examples.

In any case I am not quite sure what is the point of these “particular” solutions (in addition to the fact that they depend on datasets behind LDC paywall) anyway. Voice detection is not a difficult task compared to STT not to go general.

Or did you use the pipeline? (and not just the model?)

We just used this pipeline

pyannote.audio models can run on GPU or CPU: https://github.com/pyannote/pyannote-audio/tree/master/tutorials/pretrained/model

We had some issues running on CPU @adamnsamdle could you please elaborate

Anyway, I’d recommend you use this model that gives you a nicer API for processing short chunks.

Sorry for missing those. I assumed that all of the pre-trained models were presented here We will definitely try the “bare” models

This model and this models are similar, but trained of different datasets, right?

Sorry, I did not understand – maybe provide a link to the actual benchmark script?

The script is not a problem I will ask people providing the data if they are ok if the validation dataset based on their data is published

Top Results From Across the Web

Quality Benchmarks Between audiotok / webrtcvad / pyannote ...

We have compared 4 easy-to-use off-the-shelf instruments for voice activity / audio activity detection with off-the-shelf parameters (we did not ...

One Voice Detector to Rule Them All - The Gradient

Draw a Precision-Recall curve;. Quality metrics. We decided to compare our new model with the following models: WebRTC VAD with a 30ms chunk; ......

Reference — pyannote.metrics 3.2 documentation

This metric can be used to evaluate binary classification tasks such as speech activity detection, for instance. Inputs are expected to only contain...

arXiv:2104.04045v2 [eess.AS] 10 Jun 2021

The main conclusion is that, despite it being trained for segmentation, our model is better than other models trained specifically for voice.

Speaker diarization with pyannote.pipeline

pyannote.audio is an open-source toolkit written in Python for speaker ... pyannote.audio should be installed from the develop branch of the Github ...