从麦克风产生频谱图 [英] Producing spectrogram from microphone

查看:188
本文介绍了从麦克风产生频谱图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面,我有将要从麦克风输入的代码,如果音频块的平均值超过某个阈值,它将生成音频块的频谱图(长30毫秒).这是在正常对话过程中生成的频谱图的样子:

Below I have code that will take input from a microphone, and if the average of the audio block passes a certain threshold it will produce a spectrogram of the audio block (which is 30 ms long). Here is what a generated spectrogram looks like in the middle of normal conversation:

从我所看到的情况来看,这看起来并不像我期望的频谱图在给定音频及其环境的情况下的样子.我期待更多类似以下内容的内容(转置为节省空间):

From what I have seen, this doesn't look anything like what I'd expect a spectrogram to look like given the audio and it's environment. I was expecting something more like the following (transposed to preserve space):

我录制的麦克风是Macbook上的默认麦克风,有什么问题的建议吗?

The microphone I'm recording with is the default on my Macbook, any suggestions on what's going wrong?

record.py:

import pyaudio
import struct
import math
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt


THRESHOLD = 40 # dB
RATE = 44100
INPUT_BLOCK_TIME = 0.03 # 30 ms
INPUT_FRAMES_PER_BLOCK = int(RATE * INPUT_BLOCK_TIME)

def get_rms(block):
    return np.sqrt(np.mean(np.square(block)))

class AudioHandler(object):
    def __init__(self):
        self.pa = pyaudio.PyAudio()
        self.stream = self.open_mic_stream()
        self.threshold = THRESHOLD
        self.plot_counter = 0

    def stop(self):
        self.stream.close()

    def find_input_device(self):
        device_index = None
        for i in range( self.pa.get_device_count() ):
            devinfo = self.pa.get_device_info_by_index(i)
            print('Device %{}: %{}'.format(i, devinfo['name']))

            for keyword in ['mic','input']:
                if keyword in devinfo['name'].lower():
                    print('Found an input: device {} - {}'.format(i, devinfo['name']))
                    device_index = i
                    return device_index

        if device_index == None:
            print('No preferred input found; using default input device.')

        return device_index

    def open_mic_stream( self ):
        device_index = self.find_input_device()

        stream = self.pa.open(  format = pyaudio.paInt16,
                                channels = 1,
                                rate = RATE,
                                input = True,
                                input_device_index = device_index,
                                frames_per_buffer = INPUT_FRAMES_PER_BLOCK)

        return stream

    def processBlock(self, snd_block):
        f, t, Sxx = signal.spectrogram(snd_block, RATE)
        plt.pcolormesh(t, f, Sxx)
        plt.ylabel('Frequency [Hz]')
        plt.xlabel('Time [sec]')
        plt.savefig('data/spec{}.png'.format(self.plot_counter), bbox_inches='tight')
        self.plot_counter += 1

    def listen(self):
        try:
            raw_block = self.stream.read(INPUT_FRAMES_PER_BLOCK, exception_on_overflow = False)
            count = len(raw_block) / 2
            format = '%dh' % (count)
            snd_block = np.array(struct.unpack(format, raw_block))
        except Exception as e:
            print('Error recording: {}'.format(e))
            return

        amplitude = get_rms(snd_block)
        if amplitude > self.threshold:
            self.processBlock(snd_block)
        else:
            pass

if __name__ == '__main__':
    audio = AudioHandler()
    for i in range(0,100):
        audio.listen()


基于评论的


Edits based on comments:

如果我们将速率限制为16000 Hz,并为色图使用对数刻度,则这是在麦克风附近敲击的输出:

If we constrain the rate to 16000 Hz and use a logarithmic scale for the colormap, this is an output for tapping near the microphone:

对我来说,这看上去还是有些奇怪,但似乎也朝着正确的方向迈出了一步.

Which still looks slightly odd to me, but also seems like a step in the right direction.

使用Sox并与我的程序生成的频谱图进行比较:

Using Sox and comparing with a spectrogram generated from my program:

推荐答案

首先,请注意您的代码在彼此之上最多绘制100个声谱图(如果多次调用processBlock),而您只能看到最后一个声谱图.您可能要修复该问题.此外,我假设您知道为什么要使用30ms录音.就个人而言,我想不出一个实际的应用,笔记本电脑麦克风记录的30ms可以提供有趣的见解.它取决于您正在录制的内容以及如何触发录制,但是此问题与实际问题有关.

First, observe that your code plots up to 100 spectrograms (if processBlock is called multiple times) on top of each other and you only see the last one. You may want to fix that. Furthermore, I assume you know why you want to work with 30ms audio recordings. Personally, I can't think of a practical application where 30ms recorded by a laptop microphone could give interesting insights. It hinges on what you are recording and how you trigger the recording, but this issue is tangential to the actual question.

否则,代码将完美运行.只需在processBlock函数中进行一些小改动,并应用一些背景知识,就可以得到有用的和美观的声谱图.

Otherwise the code works perfectly. With just a few small changes in the processBlock function, applying some background knowledge, you can get informative and aesthetic spectrograms.

因此,让我们谈谈实际的频谱图.我将SoX输出作为参考.彩条注释说它是dBFS 1 ,它是对数度量(dB是

So let's talk about actual spectrograms. I'll take the SoX output as reference. The colorbar annotation says that it is dBFS1, which is a logarithmic measure (dB is short for Decibel). So, let's first convert the spectrogram to dB:

    f, t, Sxx = signal.spectrogram(snd_block, RATE)   
    dBS = 10 * np.log10(Sxx)  # convert to dB
    plt.pcolormesh(t, f, dBS)

这改善了色阶.现在,我们看到了以前隐藏的较高频段中的噪声.接下来,让我们解决时间问题.频谱图将信号划分为多个段(默认长度为256),并为每个段计算频谱.这意味着我们具有出色的频率分辨率,但时间分辨率却很差,因为只有少数这样的片段适合信号窗口(大约1300个样本长).在时间分辨率和频率分辨率之间始终需要权衡取舍.这与不确定性原理有关.因此,通过将信号分成更短的段,让我们将一些频率分辨率换成时间分辨率:

This improved the color scale. Now we see noise in the higher frequency bands that was hidden before. Next, let's tackle time resolution. The spectrogram divides the signal into segments (default length is 256) and computes the spectrum for each. This means we have excellent frequency resolution but very poor time resolution because only a few such segments fit into the signal window (which is about 1300 samples long). There is always a trade-off between time and frequency resolution. This is related to the uncertainty principle. So let's trade some frequency resolution for time resolution by splitting the signal into shorter segments:

f, t, Sxx = signal.spectrogram(snd_block, RATE, nperseg=64)

太好了!现在我们在两个轴上都获得了相对平衡的分辨率-但请稍等!为什么结果如此像素化?实际上,这就是短30ms时间窗口中的所有信息.只能以多种方式将1300个样本二维分布.但是,我们可以作弊,并使用较高的FFT分辨率和重叠的片段.尽管它不提供其他信息,但可以使结果更平滑:

Great! Now we got a relatively balanced resolution on both axes - but wait! Why is the result so pixelated?! Actually, this is all the information there is in the short 30ms time window. There are only so many ways 1300 samples can be distributed in two dimensions. However, we can cheat a bit and use higher FFT resolution and overlapping segments. This makes the result smoother although it does not provide additional information:

f, t, Sxx = signal.spectrogram(snd_block, RATE, nperseg=64, nfft=256, noverlap=60)

看到漂亮的光谱干涉图样. (这些模式取决于所使用的窗口函数,在这里我们不作详细介绍.请参见频谱图函数的window参数来使用它们.)结果看起来不错,但实际上所包含的信息不超过上一张图片.

Behold pretty spectral interference patterns. (These patterns depend on the window function used, but let's not get caught in details, here. See the window argument of the spectrogram function to play with these.) The result looks nice, but actually does not contain any more information than the previous image.

要使结果更接近SoX-lixe,请观察SoX频谱图在时间轴上是否模糊.您可以通过使用原始的低时间分辨率(长片段)来获得此效果,但是为了平滑而重叠它们:

To make the result more SoX-lixe observe that the SoX spectrogram is rather smeared on the time axis. You get this effect by using the original low time resolution (long segments) but let them overlap for smoothness:

f, t, Sxx = signal.spectrogram(snd_block, RATE, noverlap=250)

我个人更喜欢第三种解决方案,但是您将需要找到自己喜欢的时间/频率权衡.

I personally prefer the 3rd solution, but you will need to find your own preferred time/frequency trade-off.

最后,让我们使用更类似于SoX的颜色图:

Finally, let's use a colormap that is more like SoX's:

plt.pcolormesh(t, f, dBS, cmap='inferno')

对以下行的简短评论:

THRESHOLD = 40 # dB

将阈值与输入信号的RMS进行比较,不是以dB为单位,而是原始幅度单位.

The threshold is compared against the RMS of the input signal, which is not measured in dB but raw amplitude units.

1 显然,FS是满量程的缩写. dBFS表示dB量度是相对于最大范围的. 0 dB是当前表示形式中最大的信号,因此实际值必须< = 0 dB.

1 Apparently FS is short for full scale. dBFS means that the dB measure is relative to the maximum range. 0 dB is the loudest signal possible in the current representation, so actual values must be <= 0 dB.

这篇关于从麦克风产生频谱图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆