Python WebRTC语音活动检测错误 [英] python webrtc voice activity detection is wrong

查看:194
本文介绍了Python WebRTC语音活动检测错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要进行语音活动检测,作为对音频文件进行分类的步骤.

I need to do voice activity detection as a step to classify audio files.

基本上,我需要确定是否知道给定的音频是否使用了口语.

Basically, I need to know with certainty if a given audio has spoken language.

我正在使用py-webrtcvad,该文件是我在git-hub中找到的,几乎没有记录:

I am using py-webrtcvad, which I found in git-hub and is scarcely documented:

https://github.com/wiseman/py-webrtcvad

问题是,当我在自己的音频文件上尝试时,它可以与有语音的文件一起正常工作,但是当我将其与其他类型的音频(例如音乐或鸟声)一起输入时,仍会产生误报将侵略性设为3.

Thing is, when I try it on my own audio files, it works fine with the ones that have speech but keeps yielding false positives when I feed it with other types of audio (like music or bird sound), even if I set aggressiveness at 3.

音频为8000采样/hz

Audios are 8000 sample/hz

我更改源代码的唯一一件事就是将参数传递给主函数的方式(不包括sys.args).

The only thing I changed to the source code was the way I pass the arguments to main function (excluding sys.args).

def main(file, agresividad):

    audio, sample_rate = read_wave(file)
    vad = webrtcvad.Vad(int(agresividad))
    frames = frame_generator(30, audio, sample_rate)
    frames = list(frames)
    segments = vad_collector(sample_rate, 30, 300, vad, frames)
    for i, segment in enumerate(segments):
        path = 'chunk-%002d.wav' % (i,)
        print(' Writing %s' % (path,))
        write_wave(path, segment, sample_rate)

if __name__ == '__main__':

    file = 'myfilename.wav'
    agresividad = 3 #aggressiveness
    main(file, agresividad)  

推荐答案

我看到的是同一件事.恐怕这只是它的作用范围.语音检测是一项艰巨的任务,webrtcvad希望利用资源,因此您只能做很多事情.如果需要更高的准确性,则需要使用不同的软件包/方法,这些软件包/方法必然会占用更多的计算能力.

I'm seeing the same thing. I'm afraid that's just the extent to which it works. Speech detection is a difficult task and webrtcvad wants to be light on resources so there's only so much you can do. If you need more accuracy then you would need different packages/methods that will necessarily take more computing power.

关于进取心,您说对了,即使是3,仍然存在很多误报.但是我也看到了假阴性,所以我正在使用的一个技巧是运行三个检测器实例,每个实例设置一个.然后,我没有将帧0或1进行分类,而是给它提供了最高进取性的值,该值仍然是语音.换句话说,每个样本现在的得分为0到3,其中0表示即使是最不严格的检测器也不是语音,而3表示甚至是最严格的设置也是如此.这样我可以获得更多的分辨率,即使出现误报,对我来说也足够好.

On aggressiveness, you're right that even on 3 there are still a lot of false positives. I'm also seeing false negatives however so one trick I'm using is running three instances of the detector, one for each aggressiveness setting. Then instead of classifying a frame 0 or 1 I give it the value of the highest aggressiveness that still said it was speech. In other words each sample now has a score of 0 to 3 with 0 meaning even the least strict detector said it wasn't speech and 3 meaning even the strictest setting said it was. I get a little bit more resolution like that and even with the false positives it is good enough for me.

这篇关于Python WebRTC语音活动检测错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆