确定音频样本调的算法 [英] Algorithms for determining the key of an audio sample

查看:24
本文介绍了确定音频样本调的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对确定音频样本的音调很感兴趣.算法将(或可能)如何尝试近似音乐音频样本的调?

I am interested in determining the musical key of an audio sample. How would (or could) an algorithm go about trying to approximate the key of a musical audio sample?

Antares Autotune 和 Melodyne 是两种可以执行此类操作的软件.

Antares Autotune and Melodyne are two pieces of software that do this sort of thing.

任何人都可以就这将如何工作给出一些外行的解释吗?通过分析和弦进行等的频谱,以数学方式推导出歌曲的调.

Can anyone give a bit of a layman's explanation about how this would work? To mathematically deduce the key of a song by analysing the frequency spectrum for chord progressions etc.

这个话题让我很感兴趣!

This topic interests me a lot!

编辑 - 可以从对此问题做出贡献的每个人那里找到精彩的来源和丰富的信息.

特别来自:the_mandrill 和 Daniel Brückner.

Especially from: the_mandrill and Daniel Brückner.

推荐答案

值得注意的是,这是一个非常棘手的问题,如果您没有信号处理的背景(或没有兴趣了解它),那么你有一个非常令人沮丧的时间在你面前.如果您希望在问题上抛出几个 FFT,那么您不会走得太远.我希望你确实有兴趣,因为这是一个非常迷人的领域.

It's worth being aware that this is a very tricky problem and if you don't have a background in signal processing (or an interest in learning about it) then you have a very frustrating time ahead of you. If you're expecting to throw a couple of FFTs at the problem then you won't get very far. I hope you do have the interest as it is a really fascinating area.

最初存在音高识别问题,对于简单的单声道乐器(例如语音),使用自相关或谐波和谱等方法(例如参见 Paul R 的链接)可以相当容易地做到这一点.但是,您经常会发现这会产生错误的结果:您通常会得到预期的一半或两倍的音高.这被称为音调周期加倍倍频程误差,它的发生主要是因为 FFT 或自相关假设数据随时间具有恒定特性.如果你有一种人类演奏的乐器,总会有一些变化.

Initially there is the problem of pitch recognition, which is reasonably easy to do for simple monophonic instruments (eg voice) using a method such as autocorrelation or harmonic sum spectrum (eg see Paul R's link). However, you'll often find that this gives the wrong results: you'll often get half or double the pitch that you were expecting. This is called pitch period doubling or octave errors and it occurs essentially because the FFT or autocorrelation has an assumption that the data has constant characteristics over time. If you have an instrument played by a human there will always be some variation.

有些人将识别问题视为首先进行音高识别,然后从音高序列中找到键.如果除了单音音高序列之外还有其他任何东西,这难以置信困难.如果您确实有单音音高序列,那么它仍然不是确定调的明确方法:例如,您如何处理半音音符,或者确定它是大调还是小调.因此,您需要使用类似于 Krumhansl 的密钥查找算法的方法.

Some people approach the problem of key recognition as being a matter of doing the pitch recognition first and then finding the key from the sequence of pitches. This is incredibly difficult if you have anything other than a monophonic sequence of pitches. If you do have a monophonic sequence of pitches then it's still not a clear cut method of determining the key: how you deal with chromatic notes, for instance, or determining whether it's major or minor. So you'd need to use a method similar to Krumhansl's key finding algorithm.

因此,鉴于这种方法的复杂性,另一种方法是查看同时演奏的所有音符.如果您有和弦,或不止一种乐器,那么您将同时演奏许多正弦曲线的丰富频谱汤.每个单独的音符由多个谐波组成,一个基频,所以 A(440Hz)将由 440、880、1320 的正弦曲线组成......此外,如果你弹奏 E(请参阅此图表以了解音高)然后是 659.25Hz,几乎 A 的一倍半(实际上是 1.498).这意味着 A 的每个 3 次谐波与 E 的每个 2 次谐波重合.这就是和弦听起来悦耳的原因,因为它们共享谐波.(顺便说一句,西方和声奏效的全部原因是由于命运的怪癖,即 2 的 7 次方的 12 次方根接近 1.5)

So, given the complexity of this approach, an alternative is to look at all the notes being played at the same time. If you have chords, or more than one instruments then you're going to have a rich spectral soup of many sinusoids playing at once. Each individual note is comprised of multiple harmonics a fundamental frequency, so A (at 440Hz) will be comprised of sinusoids at 440, 880, 1320... Furthermore, if you play an E (see this diagram for pitches) then that is 659.25Hz which is almost one and a half times that of A (actually 1.498). This means that every 3rd harmonic of A coincides with every 2nd harmonic of E. This is the reason that chords sound pleasant, because they share harmonics. (as an aside, the whole reason that western harmony works is due to the quirk of fate that the twelfth root of 2 to the power 7 is nearly 1.5)

如果您超越了大调、小调和其他和弦的 5 度音程,您会发现其他比例.我认为许多关键的发现技术将枚举这些比率,然后为信号中的每个频谱峰值填充直方图.因此,在检测 A5 和弦的情况下,您会期望在 440、880、659、1320、1760、1977 处找到峰值.对于 B5,它将是 494、988、741 等.因此创建一个频率直方图,并为每个信号中的正弦峰值(例如来自 FFT 功率谱)增加直方图条目.然后对于每个键 A-G 统计直方图中的 bin,条目最多的那些最有可能是你的关键.

If you looked beyond this interval of a 5th to major, minor and other chords then you'll find other ratios. I think that many key finding techniques will enumerate these ratios and then fill a histogram for each spectral peak in the signal. So in the case of detecting the chord A5 you would expect to find peaks at 440, 880, 659, 1320, 1760, 1977. For B5 it'll be 494, 988, 741, etc. So create a frequency histogram and for every sinusoidal peak in the signal (eg from the FFT power spectrum) increment the histogram entry. Then for each key A-G tally up the bins in your histogram and the ones with the most entries is most likely to be your key.

这只是一种非常简单的方法,但可能足以找到弹奏或持续和弦的调.您还必须将信号分成小间隔(例如 20 毫秒)并分析每个间隔以建立更可靠的估计.

That's just a very simple approach but may be enough to find the key of a strummed or sustained chord. You'd also have to chop the signal into small intervals (eg 20ms) and analyse each one to build up a more robust estimate.


如果您想进行实验,那么我建议您下载一个软件包,例如 OctaveCLAM 这使得可视化音频数据和运行 FFT 和其他操作变得更容易.


If you want to experiment then I'd suggest downloading a package like Octave or CLAM which makes it easier to visualise audio data and run FFTs and other operations.

其他有用的链接:

  • My PhD thesis on some aspects of pitch recognition -- the maths is a bit heavy going but chapter 2 is (I hope) quite an accessible introduction to the different approaches of modelling musical audio
  • http://en.wikipedia.org/wiki/Auditory_scene_analysis -- Bregman's Auditory Scene analysis which though not talking about music has some fascinating findings about how we perceive complex scenes
  • Dan Ellis has done some great papers in this and similar areas
  • Keith Martin has some interesting approaches

这篇关于确定音频样本调的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆