将PCM波形数据转换为numpy数组,反之亦然 [英] Convert PCM wave data to numpy arrays and vice versa

查看:527
本文介绍了将PCM波形数据转换为numpy数组,反之亦然的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

情况

我正在通过使用 WebRTC-VAD 从WebRTC使用VAD(语音活动检测) Python适配器. GitHub存储库中的示例实现使用Python的 wave模块,用于从文件中读取PCM数据.请注意,根据注释,该模块仅适用于单声道音频,并且采样率为8000、16000或32000 Hz.

I am using VAD (Voice Activity Detection) from WebRTC by using WebRTC-VAD, a Python adapter. The example implementation from the GitHub repo uses Python's wave module to read PCM data from files. Note that according to the comments the module only works with mono audio and a sampling rate of either 8000, 16000 or 32000 Hz.

我想做什么

从具有不同采样率的任意音频文件(MP3和WAV文件)中读取音频数据,将它们转换为WebRTC-VAD正在使用的PCM表示形式,应用WebRTC-VAD来检测语音活动,最后通过产生来处理结果再次从PCM数据中获取Numpy-Array,因为使用 Librosa

Read audio data from arbitrary audio files (MP3 and WAV files) with different sampling rates, convert them into the PCM-representation that WebRTC-VAD is using, apply WebRTC-VAD to detect voice activity and finally process the result by producing Numpy-Arrays again from PCM data because they are easiest to work with when using Librosa

我的问题

仅当使用wave模块时,WebRTC-VAD模块才能正常工作.该模块将PCM数据作为bytes对象返回.将其馈入已获得的Numpy数组时,它不起作用,例如通过使用librosa.load(...).我还没有找到在两种表示形式之间进行转换的方法.

The WebRTC-VAD module only works correctly when using the wave module. This module returns PCM data as bytes objects. It does not work when feeding it Numpy arrays that have been obtained e.g. by using librosa.load(...). I have not found a way to convert between the two representations.

我到目前为止所做的事情

我编写了以下功能,以从音频文件中读取音频数据并自动将它们转换:

I have written the following functions to read audio data from audio files and automatically convert them:

使用Librosa读取/转换任何音频数据的通用函数(->返回Numpy数组):

Generic function to read/convert any audio data with Librosa (--> returns Numpy array):

def read_audio(file_path, sample_rate=None, mono=False):       
    return librosa.load(file_path, sr=sample_rate, mono=mono)

用于将任意数据读取为PCM数据的功能(->返回字节):

Functions to read arbitrary data as PCM data (--> returns bytes):

def read_audio_vad(file_path):
    audio, rate = librosa.load(file_path, sr=16000, mono=True)
    tmp_file = 'tmp.wav'
    sf.write(tmp_file, audio, rate, subtype='PCM_16')
    audio, rate = read_pcm16_wave(tmp_file)
    remove(tmp_file)
    return audio, rate

def read_pcm16_wave(file_path):    
    with wave.open(file_path, 'rb') as wf:
        sample_rate = wf.getframerate()
        pcm_data = wf.readframes(wf.getnframes())
        return pcm_data, sample_rate

如您所见,我首先通过librosa读取/转换音频数据而绕道而行.这是必需的,因此我可以读取具有任意编码的MP3文件或WAV文件,并使用Librosa自动将其重新采样为16kHz单声道.然后,我正在写一个临时文件.在删除文件之前,我会再次读取内容,但这一次使用wave模块.这给了我PCM数据.

As you can see I am making a detour by reading/converting the audio data with librosa first. This is needed so I can read from MP3 files or WAV files with arbitrary encodings and automatically resample it to 16kHz mono with Librosa. I am then writing to a temporary file. Before deleting the file, I read the contents out again, but this time using the wave module. This gives me the PCM data.

我现在有以下代码来提取语音活动并生成Numpy数组:

I now have the following code to extract the voice activity and produce Numpy arrays:

def webrtc_voice(audio, rate):
    voiced_frames = webrtc_split(audio, rate)
    tmp_file = 'tmp.wav'
    for frames in voiced_frames:
        voice_audio = b''.join([f.bytes for f in frames])
        write_pcm16_wave(tmp_file, voice_audio, rate)
        voice_audio, rate = read_audio(tmp_file)
        remove(tmp_file)

        start_time = frames[0].timestamp
        end_time = (frames[-1].timestamp + frames[-1].duration)
        start_frame = int(round(start_time * rate / 1e3)) 
        end_frame = int(round(end_time * rate / 1e3)) 
        yield voice_audio, rate, start_frame, end_frame

def write_pcm16_wave(path, audio, sample_rate):
    with wave.open(path, 'wb') as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(sample_rate)
        wf.writeframes(audio)

如您所见,我再次绕过临时文件,先写PCM数据,然后再用Librosa读出临时文件,得到一个Numpy数组. webrtc_split函数是示例实现中的实现,只有很少的微小变化.为了完整起见,我将其发布在这里:

As you can see I am taking the detour over a temporary file again to write PCM data first and then read the temporary file out again with Librosa to get a Numpy array. The webrtc_split function is the implementation from the example implementation with only few minor changes. For completeness sake I am posting it here:

def webrtc_split(audio, rate, aggressiveness=3, frame_duration_ms=30, padding_duration_ms=300):
    vad = Vad(aggressiveness)

    num_padding_frames = int(padding_duration_ms / frame_duration_ms)
    ring_buffer = collections.deque(maxlen=num_padding_frames)
    triggered = False

    voiced_frames = []
    for frame in generate_frames(audio, rate):
        is_speech = vad.is_speech(frame.bytes, rate)

        if not triggered:
            ring_buffer.append((frame, is_speech))
            num_voiced = len([f for f, speech in ring_buffer if speech])
            if num_voiced > 0.9 * ring_buffer.maxlen:
                triggered = True
                for f, s in ring_buffer:
                    voiced_frames.append(f)
                ring_buffer.clear()
        else:
            voiced_frames.append(frame)
            ring_buffer.append((frame, is_speech))
            num_unvoiced = len([f for f, speech in ring_buffer if not speech])
            if num_unvoiced > 0.9 * ring_buffer.maxlen:
                triggered = False
                yield voiced_frames
                ring_buffer.clear()
                voiced_frames = []
    if voiced_frames:
        yield voiced_frames


class Frame(object):
    """
    object holding the audio signal of a fixed time interval (30ms) inside a long audio signal
    """

    def __init__(self, bytes, timestamp, duration):
        self.bytes = bytes
        self.timestamp = timestamp
        self.duration = duration


def generate_frames(audio, sample_rate, frame_duration_ms=30):
    frame_length = int(sample_rate * frame_duration_ms / 1000) * 2
    offset = 0
    timestamp = 0.0
    duration = (float(frame_length) / sample_rate)
    while offset + frame_length < len(audio):
        yield Frame(audio[offset:offset + frame_length], timestamp, duration)
        timestamp += duration
        offset += frame_length

我的问题

我的实现方法是使用wave模块写入/读取临时文件,然后使用Librosa读取/写入这些文件以获得Numpy数组,这对我来说太复杂了.但是,尽管花了一整天的时间,但我没有找到直接在两种编码之间进行转换的方法.我承认我不完全了解PCM和WAVE文件的所有细节,也不了解将16/24/32位用于PCM数据或字节序的影响.我希望上面的解释足够详细,不要过多.有没有一种更简单的方法可以在内存中的两种表示形式之间进行转换?

My implementation with writing/reading temporary files with the wave module and reading/writing these files with Librosa to get Numpy Arrays seems overly complicated to me. However, despite spending a whole day on the matter I did not find a way to convert directly between the two encodings. I admit I don't fully understand all the details of PCM and WAVE files, the impact of using 16/24/32-Bit for PCM data or the endianness. I hope my explanations above are detailed enough and not too much. Is there an easier way to convert between the two representations in-memory?

推荐答案

似乎WebRTC-VAD和Python包装器

It seems that WebRTC-VAD, and the Python wrapper, py-webrtcvad, expects the audio data to be 16bit PCM little-endian - as is the most common storage format in WAV files.

librosa及其基础I/O库pysoundfile始终返回范围为[-1.0, 1.0]的浮点数组.要将其转换为包含16位PCM的字节,可以使用以下float_to_pcm16函数.

librosa and its underlying I/O library pysoundfile however always returns floating point arrays in the range [-1.0, 1.0]. To convertt this to bytes containing 16bit PCM you can use the following float_to_pcm16 function.

def float_to_pcm16(audio):
    import numpy

    ints = (audio * 32767).astype(numpy.int16)
    little_endian = ints.astype('<u2')
    buf = little_endian.tostring()
    return buf


def read_pcm16(path):
    import soundfile

    audio, sample_rate = soundfile.read(path)
    assert sample_rate in (8000, 16000, 32000, 48000)
    pcm_data = float_to_pcm16(audio)
    return pcm_data, sample_rate

这篇关于将PCM波形数据转换为numpy数组,反之亦然的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆