Quality Benchmarks Between audiotok / webrtcvad / pyannote-audio / silero-vad
See original GitHub issueInstruments
We have compared 4 easy-to-use off-the-shelf instruments for voice activity / audio activity detection with off-the-shelf parameters (we did not hack / rebuild / retrain the tools and just used them as-is):
- A popular python version of the webrtcvad - https://github.com/wiseman/py-webrtcvad;
- Audiotok from this repo;
- Silero-vad from here - https://github.com/snakers4/silero-vad;
- Speech activity detector from
pyannote- https://github.com/pyannote/pyannote-audio-hub#speech-activity-detection
Caveats
- Full disclaimer - we are mostly interested in portable production voice detection in phone calls, not just silence detection or dataset preparation where all of these tools will more or less work just fine;
- In our extensive experiments we noticed that WebRTC is actually much better in detecting silence than detecting speech (probably by design). It has a lot of false positives when detecting speech;
audiotokprovides Audio Activity Detection, which probably may just mean detecting silence in layman’s terms;silero-vadis geared towards speech detection (as opposed to noise or music);- A sensible chunk size for our VAD is at least 75-100ms (pauses in speech shorter than 100ms are not very meaningful, but we prefer 150-250ms chunks, see quality comparison here), while
audiotokandwebrtcvaduse 30-50ms chunks (we used default values of 30 ms forwebrtcvadand 50 ms foraudiotok); - Maybe we missed something, but using
pyannotehad a ton of performance caveats, i.e. it required a GPU out of the box and worked very slowly for small files (but quite fast for long files). Also as far as we dug, streaming / online application was not possible withpyannotewith standard provided pipeline;
Testing Methodology
Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology
Known limitations:
- Usually speech is very self-correlated (i.e. if one person speaks, he will continue speaking for some time), but our test is extremely hard, because it essentially gives an algorithm only one chance. Essentially we test how well all of algorithms determine the first / last frame of speech without the luxury of being “in the middle” of speech;
- Since we wanted to provide PR curves, it was a bit tricky with
py-webrtcvadandpyannotewithout essentially rebuilding / modifying C++ code forpy-webrtcvadand without modifying pipeline code forpyannote. Therefore we had to interpret binary decisions made by these tools into probabilities or just provide one dot on the curve; - For production usage of
silero-vador to use it with different new domains, you have to set at least some of the params properly. It can be done via provided plotting tool;
Testing Results
Finished tests:

Portability, Speed, Production Limitations
- Looks like originally
webrtcvadis written inС++around 2016, so theoretically it can be ported into many platforms; - I have inquired in the community, the original VAD seems to have matured and python version is based on 2018 version;
- Looks like
audiotokis written in plain python, but I guess the algorithm itself can be ported; silero-vadis based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);- Looks like
pyannotealso is built with PyTorch, but it requires GPU out of the box and is extremely slow with short files;
| Tool | Speed | Streaming | GPU | Portability |
|---|---|---|---|---|
| py-webrtcvad | extremely fast | yes | not required | you can build and port |
| auditok | very fast | yes | not required | python only, you can try porting |
| silero-vad | very fast | yes | not required | PyTorch and ONNX |
| pyannote | fast for long files slow for small files | no | required | PyTorch |
Also we ran a 30-minute audio file via pyannote and silero-vad:
silero-vad- 20 seconds on CPU;pyannote- 12 seconds on 3090;
This is by no means an extensive and full research on the topic, please point out if anything is lacking.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Quality Benchmarks Between audiotok / webrtcvad / pyannote ...
We have compared 4 easy-to-use off-the-shelf instruments for voice activity / audio activity detection with off-the-shelf parameters (we did not ...
Read more >One Voice Detector to Rule Them All - The Gradient
Draw a Precision-Recall curve;. Quality metrics. We decided to compare our new model with the following models: WebRTC VAD with a 30ms chunk; ......
Read more >Reference — pyannote.metrics 3.2 documentation
This metric can be used to evaluate binary classification tasks such as speech activity detection, for instance. Inputs are expected to only contain...
Read more >arXiv:2104.04045v2 [eess.AS] 10 Jun 2021
The main conclusion is that, despite it being trained for segmentation, our model is better than other models trained specifically for voice.
Read more >Speaker diarization with pyannote.pipeline
pyannote.audio is an open-source toolkit written in Python for speaker ... pyannote.audio should be installed from the develop branch of the Github ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
FYI, I just made public a preprint in which I use my own (probably unfair) benchmark – comparing
silero-vadwith default hyper-parameters andpyannote.audionewpyannote/segmentationpretrained model.Hi,
There are not so many public voice detectors. We simply tried comparing the few that have any semblance of examples / docs / examples.
In any case I am not quite sure what is the point of these “particular” solutions (in addition to the fact that they depend on datasets behind LDC paywall) anyway. Voice detection is not a difficult task compared to STT not to go general.
We just used this pipeline
We had some issues running on CPU @adamnsamdle could you please elaborate
Sorry for missing those. I assumed that all of the pre-trained models were presented here We will definitely try the “bare” models
This model and this models are similar, but trained of different datasets, right?
The script is not a problem I will ask people providing the data if they are ok if the validation dataset based on their data is published