如何在Python中关联两个音频事件(如果它们相似,则进行检测) [英] How to Correlate Two Audio Events (Detect if they are Similar) in Python

查看:141
本文介绍了如何在Python中关联两个音频事件(如果它们相似,则进行检测)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的项目,我必须检测两个音频文件是否相似以及何时在第二个音频文件中包含第一个音频文件.我的问题是我尝试使用numpy.correlate的librosa.我不知道我是否做对了.如何检测另一个音频文件中是否包含音频?

 导入librosa导入numpylong_audio_series,long_audio_rate = librosa.load("C:\\ Users \\ Jerry \\ Desktop \\ long_file.mp3")short_audio_series,short_audio_rate = librosa.load("C:\\ Users \\ Jerry \\ Desktop \\ short_file.mka")对于long_stream_id,enumerate(long_audio_series)中的long_stream:对于short_stream_id,enumerate(short_audio_series)中的short_stream:打印(numpy.correlate(long_stream,short_stream)) 

解决方案

仅比较音频信号 long_audio_series short_audio_series 可能不起作用.我建议您做的是音频指纹识别,更确切地说,本质上是Shazam所做的可怜人的版本.当然,有

的星座图图像

如果您不想扩展到很多歌曲,则可以跳过整个哈希部分,而专注于发现峰值.

所以您需要做的是:

  1. 创建功率谱图(通过 librosa.core.stft ).
  2. 在所有文件中查找局部峰(可以通过 scipy.ndimage.filters.maximum_filter 完成)来创建CM,即仅包含峰的2D图像.生成的CM通常是二进制的,即包含无峰的 0 和无峰的 1 .
  3. 将查询CM(基于 short_audio_series )滑到每个数据库CM(基于 long_audio_series )上.对于每个时间步长,计数并对齐多少个星星"(即 1 s),并将其与滑动偏移量(基本上是长音频中的短音频的位置)一起存储.
  4. 选择最大计数,并在长音频中返回相应的短音频和位置.您必须将帧号转换回秒数./li>

幻灯片"示例(未经试用的示例代码):

 将numpy导入为np分数= {}cm_short = ...#简短音频的2d星座图cm_long = ...#长音频的2d星座图#我们假设昏暗的0是时间范围#和dim 1是频率区#两个CM仅包含0或1frame_short = cm_short.shape [0]frame_long = cm_long.shape [0]对于范围内的偏移量(frames_long-frames_short):cm_long_excerpt = cm_long [offset:offset + frames_short]分数= np.sum(np.multiply(cm_long_excerpt,cm_short))分数[偏移] =分数#TODO:在得分"中找到最高得分,然后#将偏移量转换回秒 

现在,如果您的数据库很大,这将导致太多的比较,并且您还必须实现哈希方案,这在我上面链接的文章中也有描述.

请注意,所描述的过程仅与相同录音匹配,但会产生噪音和轻微失真.如果那不是您想要的,请更好地定义相似度" ,因为这可能是各种各样的事情(鼓音色,和弦序列,乐器等等).查找这些特征相似性的经典基于DSP的方法如下:为短帧(例如256个样本)提取适当的特征,然后计算相似性.例如,如果您感兴趣的是谐波含量,则可以提取色度矢量,然后计算色度矢量之间的距离,例如余弦距离.当您计算数据库信号中每个帧与查询信号中每个帧的相似度时,最终会得到类似于

Simply comparing the audio signals long_audio_series and short_audio_series probably won't work. What I'd recommend doing is audio fingerprinting, to be more precise, essentially a poor man's version of what Shazam does. There is of course the patent and the paper, but you might want to start with this very readable description. Here's the central image, the constellation map (CM), from that article:

If you don't want to scale to very many songs, you can skip the whole hashing part and concentrate on peak finding.

So what you need to do is:

  1. Create a power spectrogram (easy with librosa.core.stft).
  2. Find local peaks in all your files (can be done with scipy.ndimage.filters.maximum_filter) to create CMs, i.e., 2D images only containing the peaks. The resulting CM is typically binary, i.e. containing 0 for no peaks and 1 for peaks.
  3. Slide your query CM (based on short_audio_series) over each of your database CM (based on long_audio_series). For each time step count how many "stars" (i.e. 1s) align and store the count along with the slide offset (essentially the position of the short audio in the long audio).
  4. Pick the max count and return the corresponding short audio and position in the long audio. You will have to convert frame numbers back to seconds.

Example for the "slide" (untested sample code):

import numpy as np

scores = {}
cm_short = ...  # 2d constellation map for the short audio
cm_long = ...   # 2d constellation map for the long audio
# we assume that dim 0 is the time frame
# and dim 1 is the frequency bin
# both CMs contains only 0 or 1
frames_short = cm_short.shape[0]
frames_long = cm_long.shape[0]
for offset in range(frames_long-frames_short):
    cm_long_excerpt = cm_long[offset:offset+frames_short]
    score = np.sum(np.multiply(cm_long_excerpt, cm_short))
    scores[offset] = score
# TODO: find the highest score in "scores" and
# convert its offset back to seconds

Now, if your database is large, this will lead to way too many comparisons and you will also have to implement the hashing scheme, which is also described in the article I linked to above.

Note that the described procedure only matches identical recordings, but allows for noise and slight distortion. If that is not what you want, please define similarity a little better, because that could be all kinds of things (drum patterns, chord sequence, instrumentation, ...). A classic, DSP-based way to find similarities for these features is the following: Extract the appropriate feature for short frames (e.g. 256 samples) and then compute the similarity. E.g., if harmonic content is of interest to you, you could extract chroma vectors and then calculate a distance between chroma vectors, e.g., cosine distance. When you compute the similarity of each frame in your database signal with every frame in your query signal you end up with something similar to a self similarity matrix (SSM) or recurrence matrix (RM). Diagonal lines in the SSM/RM usually indicate similar sections.

这篇关于如何在Python中关联两个音频事件(如果它们相似,则进行检测)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆