注意发病检测 [英] Note onset detection

查看:175
本文介绍了注意发病检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个系统来辅助进行转录的音乐家。这样做的目的是执行自动音乐转录(它不会是完美的,因为用户将正确的故障/后错误)一台仪器上记录单声道。是否有人在这里有自动音乐转录的经验?或在一般的数字信号处理?任何人的帮助是很大的AP preciated不管你是什么背景。

I am developing a system as an aid to musicians performing transcription. The aim is to perform automatic music transcription (it does not have to be perfect, as the user will correct glitches / mistakes later) on a single instrument monophonic recording. Does anyone here have experience in automatic music transcription? Or digital signal processing in general? Help from anyone is greatly appreciated no matter what your background.

到目前为止,我已经研究了利用快速傅立叶变换音调检测,以及一些在这两个MATLAB和我自己的Java测试程序测试表明,它是快速和我的需求不够准确。这将需要解决的任务的另一个要素是在五线谱的形式产生的MIDI数据的显示,但是这是什么我不关心现在。

So far I have investigated the use of the Fast Fourier Transform for pitch detection, and a number of tests in both MATLAB and my own Java test programs have shown it to be fast and accurate enough for my needs. Another element of the task that will need to be tackled is the display of the produced MIDI data in sheet music form, but this is something I am not concerned with right now.

在简单地说,我所寻求的是音符开始检测一个好方法,即在新的笔记开始信号的位置。慢声母是相当困难的正确检测,我最初将使用该系统采用钢琴录音。这也是部分原因其实我弹钢琴,应该是在一个更好的位置,以获得合适的录音进行测试。如上所述,该系统的早期版本将被用于简单的单声道录音,以后可能发展成依赖于在未来几周内取得的进展更为复杂的输入。

In brief, what I am looking for is a good method for note onset detection, i.e. the position in the signal where a new note begins. As slow onsets can be quite difficult to detect properly, I will initially be using the system with piano recordings. This is also partially due to the fact I play piano and should be in a better position to obtain suitable recordings for testing. As stated above, early versions of this system will be used for simple monophonic recordings, possibly progressing later to more complex input depending on progress made in the coming weeks.

推荐答案

下面是说明了门槛的办法​​注意发病检测图形:

Here is a graphic that illustrates the threshold approach to note onset detection:

此图像显示在连续打了三分立的音符一个典型的WAV文件。红线重新presents选定的信号阈值,蓝色线重新用简单的算法,标志着一个开始当信号电平超过阈值返回present音符的起始位置。

This image shows a typical WAV file with three discrete notes played in succession. The red line represents a chosen signal threshold, and the blue lines represent note start positions returned by a simple algorithm that marks a start when the signal level crosses the threshold.

作为图像显示,选择适当的绝对阈值是困难的。在这种情况下,第一个音符拿起罚款,第二个音符是完全无缘,第三音符(勉强)开始得很晚。一般而言,低门槛使你拿起幻影笔记,同时提高了它会导致你错过笔记。一种解决这个问题的是使用按一定比例在一定时间触发,如果信号增加一个开始的相对阈值,但是这有它自己的问题。

As the image shows, selecting a proper absolute threshold is difficult. In this case, the first note is picked up fine, the second note is missed completely, and the third note (barely) is started very late. In general, a low threshold causes you to pick up phantom notes, while raising it causes you to miss notes. One solution to this problem is to use a relative threshold that triggers a start if the signal increases by a certain percentage over a certain time, but this has problems of its own.

有一个简单的解决方案是使用有点-直觉相反名为com pression(不是MP3 COM pression - 这是别的东西完全)上的波形文件第一次。玉米pression基本上变平的尖峰在音频数据,然后放大一切,使得更多的音频的是邻近的最大值。对上述样品的效果是这样的(这说明了为什么取名COM pression似乎没有任何意义 - 对音响设备它通常标有响度):

A simpler solution is to use the somewhat-counterintuitively named compression (not MP3 compression - that's something else entirely) on your wave file first. Compression essentially flattens the spikes in your audio data and then amplifies everything so that more of the audio is near the maximum values. The effect on the above sample would look like this (which shows why the name "compression" appears to make no sense - on audio equipment it's usually labelled "loudness"):

COM pression后,绝对阈限的做法会运行得更好(虽然很容易过度COM preSS,并开始拿起虚构的音符开始,作为降低门槛同样的效果)。有很多波的编辑在那里,​​做COM pression的好工作,而且最好让他们处理这个任务 - 你可能需要做的工作相当数量的清理你的波形文件反正之前在他们检测的注意事项。

After compression, the absolute threshold approach will work much better (although it's easy to over-compress and start picking up fictional note starts, the same effect as lowering the threshold). There are a lot of wave editors out there that do a good job of compression, and it's better to let them handle this task - you'll probably need to do a fair amount of work "cleaning up" your wave files before detecting notes in them anyway.

在编码方面,加载到存储器WAV文件基本上只有两个字节的整数的数组,其中0重新presents没有信号和32,767和-32,768重present的峰值。在其最简单的形式中,一个阈值检测算法将刚开始在第一样品和,直到找到大于阈值的值通过阵列读出。

In coding terms, a WAV file loaded into memory is essentially just an array of two-byte integers, where 0 represents no signal and 32,767 and -32,768 represent the peaks. In its simplest form, a threshold detection algorithm would just start at the first sample and read through the array until it finds a value greater than the threshold.

short threshold = 10000;
for (int i = 0; i < samples.Length; i++)
{
    if ((short)Math.Abs(samples[i]) > threshold) 
    {
        // here is one note onset point
    }
}

在实践中,这可怕的工作,因为正常的音频拥有各种高于给定阈值的瞬态尖峰的。一种解决方案是使用移动平均信号强度(即不标记的开始直到最后n个样本的平均值高于阈值)。

In practice this works horribly, since normal audio has all sorts of transient spikes above a given threshold. One solution is to use a running average signal strength (i.e. don't mark a start until the average of the last n samples is above the threshold).

short threshold = 10000;
int window_length = 100;
int running_total = 0;
// tally up the first window_length samples
for (int i = 0; i < window_length; i++)
{
    running_total += samples[i];
}
// calculate moving average
for (int i = window_length; i < samples.Length; i++)
{
    // remove oldest sample and add current
    running_total -= samples[i - window_length];
    running_total += samples[i];
    short moving_average = running_total / window_length;
    if (moving_average > threshold)
    {
        // here is one note onset point 
        int onset_point = i - (window_length / 2);
    }
}

所有这一切都需要大量的调整和设置玩弄让它准确地找到WAV文件的起始位置,通常是什么在起作用的一个文件将无法在另一工作得很好。这是你选择了一个非常困难和不完美的,解决问题的领域,但我认为它很酷,你解决它。

All of this requires much tweaking and playing around with settings to get it to find the start positions of a WAV file accurately, and usually what works for one file will not work very well on another. This is a very difficult and not-perfectly-solved problem domain you've chosen, but I think it's cool that you're tackling it.

更新:此图显示了检测注意事项的细节我离开了,即检测时注意结尾:

Update: this graphic shows a detail of note detection I left out, namely detecting when the note ends:

黄线再度presents偏离阈值。一旦算法已经检测到一个音符开始,它假定音符继续进行,直到运行平均值的信号强度低于该值(由紫线此处示出)。这是,当然,困难另一个来源,因为是在两个或更多音符重叠的情况下(复音)

The yellow line represents the off-threshold. Once the algorithm has detected a note start, it assumes the note continues until the running average signal strength drops below this value (shown here by the purple lines). This is, of course, another source of difficulties, as is the case where two or more notes overlap (polyphony).

一旦检测到启动和停止每个音符的点,你现在可以分析WAV文件数据的每一个切片以确定球场。

Once you've detected the start and stop points of each note, you can now analyze each slice of WAV file data to determine the pitches.

更新2:我刚刚看了你更新的问题。间距检测通过自动关联比FFT来实现,如果你从头开始编写自己容易得多,但如果你已经签出,并使用了pre-FFT建库,你最好使用这是肯定的。一旦你已经确定的启动和停止每一个音符的位置(并包括在一开始的攻击错过了一些填充和结束,释放部分),现在你可以拉出来的音频数据的每片并把它传递给FFT功能确定间距。

Update 2: I just read your updated question. Pitch-detection through auto-correlation is much easier to implement than FFT if you're writing your own from scratch, but if you've already checked out and used a pre-built FFT library, you're better off using it for sure. Once you've identified the start and stop positions of each note (and included some padding at the beginning and end for the missed attack and release portions), you can now pull out each slice of audio data and pass it to an FFT function to determine the pitch.

这里很重要的一点是不使用玉米$ P $的切片pssed音频数据,而是使用原始的,未修改的数据的切片。在COM pression过程中扭曲了音频和可能产生不准确的音高读。

One important point here is not to use a slice of the compressed audio data, but rather to use a slice of the original, unmodified data. The compression process distorts the audio and may produce an inaccurate pitch reading.

有关注意事项发作次数的最后一点是,它可能是一个问题较少比你想象。通常,在音乐与一个缓慢的攻击(如软件合成器)的仪器将开始注意到早于犀利的攻击工具(如钢琴)和仿佛他们已经开始在同一时间两个音符声。如果你演奏乐器在这种方式下,算法拿起相同的开始时间,这两种仪器,这是从WAV到MIDI角度不错。

One last point about note attack times is that it may be less of a problem than you think. Often in music an instrument with a slow attack (like a soft synth) will begin a note earlier than a sharp attack instrument (like a piano) and both notes will sound as if they're starting at the same time. If you're playing instruments in this manner, the algorithm with pick up the same start time for both kinds of instruments, which is good from a WAV-to-MIDI perspective.

最近更新(我希望):忘了我刚才说包括每个音符的早期攻击的一部分垫一些样本 - 我忘了这其实是对基音检测一个坏主意。很多仪器(尤其是钢琴和其他撞击式仪器)的高音部分含有不属于基本间距的倍数,并且往往会搞砸了基音检测瞬变。实际上,你要一点点攻击这个原因后,开始每个切片。

Last update (I hope): Forget what I said about including some paddings samples from the early attack part of each note - I forgot this is actually a bad idea for pitch detection. The attack portions of many instruments (especially piano and other percussive-type instruments) contain transients that aren't multiples of the fundamental pitch, and will tend to screw up pitch detection. You actually want to start each slice a little after the attack for this reason.

哦,还有一种重要的:术语COM pression在这里并不是指MP3风格融为一体pression

Oh, and kind of important: the term "compression" here does not refer to MP3-style compression.

再次更新:这里是一个简单的函数,它非动态的COM pression:

Update again: here is a simple function that does non-dynamic compression:

public void StaticCompress(short[] samples, float param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        int sign = (samples[i] < 0) ? -1 : 1;
        float norm = ABS(samples[i] / 32768); // NOT short.MaxValue
        norm = 1.0 - POW(1.0 - norm, param);
        samples[i] = 32768 * norm * sign;
    }
}

当参数= 1.0,该功能对音频没有影响。较大的参数值(2.0是好的,这将广场每个样本和最大峰值之间的归一化)将产生更多的COM pression和一个响亮的整体(但蹩脚的)声音。根据1.0价值观将产生扩张效应。

When param = 1.0, this function will have no effect on the audio. Larger param values (2.0 is good, which will square the normalized difference between each sample and the max peak value) will produce more compression and a louder overall (but crappy) sound. Values under 1.0 will produce an expansion effect.

另外一个可能是显而易见的一点:你必须记录音乐的小,非回声余地,因为回声往往拾起这个算法幻象笔记

One other probably obvious point: you should record the music in a small, non-echoic room since echoes are often picked up by this algorithm as phantom notes.

更新:这里是PSS将在C#编译版本StaticCom $ P $和明确地蒙上了一切。这将返回预期的结果:

Update: here is a version of StaticCompress that will compile in C# and explicity casts everything. This returns the expected result:

public void StaticCompress(short[] samples, double param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        Compress(ref samples[i], param);
    }
}

public void Compress(ref short orig, double param)
{
    double sign = 1;
    if (orig < 0)
    {
        sign = -1;
    }
    // 32768 is max abs value of a short. best practice is to pre-
    // normalize data or use peak value in place of 32768
    double norm = Math.Abs((double)orig / 32768.0);
    norm = 1.0 - Math.Pow(1.0 - norm, param);
    orig = (short)(32768.0 * norm * sign); // should round before cast,
        // but won't affect note onset detection
}

对不起,我的Matlab的知识得分为0。如果你张贴在为什么预期会得到回答(只是不是我)您的MATLAB功能不起作用另一个问题。

Sorry, my knowledge score on Matlab is 0. If you posted another question on why your Matlab function doesn't work as expected it would get answered (just not by me).

这篇关于注意发病检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆