keras:如何将帧级预测汇总到歌曲级预测 [英] keras: how to aggregate over frame-level predictions to song-level prediction

查看:96
本文介绍了keras:如何将帧级预测汇总到歌曲级预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对歌曲类型进行分类.对于每首歌曲,我将它们切成小帧(5s)以生成频谱图,作为神经网络的输入特征,并且每帧都具有关联的歌曲类型标签.

I am doing a song genre classification. For each song, I have chopped them into small frames (5s) to generate spectrogram as input features for a neural network and each frame has an associated song genre label.

数据如下:

   name         label   feature
   ....
   song_i_frame1 label   feature_vector_frame1
   song_i_frame2 label   feature_vector_frame2
   ...
   song_i_framek label   feature_vector_framek
   ...

我可以毫无问题地从Keras获得每个帧的预测精度.但是目前,由于将数据输入到when keras模型中,其名称会丢失,因此我无法通过多数投票汇总从帧级到歌曲级的预测结果.

I can get a prediction accuracy for each frame from Keras with no problem. But currently, I do not how to aggregate the prediction results from frame-level to song level with a majority voting since the data fed into the when keras model, their names are lost.

我如何在keras输出中保留每个标签的名称(例如song_i_frame1),以通过多数表决对歌曲进行汇总预测.还是还有其他方法可以汇总到歌曲级别的预测中??

How can I retain the names of each label (for example, the song_i_frame1) in the keras outputs to form an aggregate prediction to the song via majority voting. Or, are there other methods to aggregate to song-level prediction???

我在Google周围搜索,但找不到答案,希望对您的见解有帮助.

I googled around but cannot find an answer to this and would appreciate any insight.

推荐答案

在数据集中,每个标签都可以命名(例如:'rock').要将其与神经网络一起使用,需要将其转换为整数(例如:2),然后转换为单次热编码(例如:[0,0,1]).所以'rock' == 2 == [0,0,1].您的输出预测将采用这种一种热编码的形式. [0.1,0.1,0.9]表示预测等级2,[0.9,0.1,0.1]表示等级0等. 为此,请使用 sklearn.preprocessing.LabelBinarizer .

In the dataset each label might be named (ex: 'rock'). To use this with a neural network, this needs to be transformed to an integer (ex: 2), and then to a one-hot-encoding (ex: [0,0,1]). So 'rock' == 2 == [0,0,1]. Your output predictions will be in this one-hot-encoded form. [ 0.1, 0.1, 0.9 ] means that class 2 was predicted, [ 0.9, 0.1, 0.1 ] means class 0 etc. To do this in a reversible way, use sklearn.preprocessing.LabelBinarizer.

有几种将帧预测组合为整体预测的方法.最常见的是,按复杂度从高到低的顺序:

There are several ways of combining frame-predictions into an overall prediction. The most common are, in increasing order of complexity:

  • 大多数人对概率进行投票
  • 对概率的平均/平均投票
  • 平均几率对数
  • 概率对数奇数排序模型
  • 多实例学习

下面是前三个示例.

import numpy
from sklearn.preprocessing import LabelBinarizer

labels = [ 'rock', 'jazz', 'blues', 'metal' ] 

binarizer = LabelBinarizer()
y = binarizer.fit_transform(labels)

print('labels\n', '\n'.join(labels))
print('y\n', y)

# Outputs from frame-based classifier. 
# input would be all the frames in one song
# frame_predictions = model.predict(frames)
frame_predictions = numpy.array([
    [ 0.5, 0.2, 0.3, 0.9 ],
    [ 0.9, 0.2, 0.3, 0.3 ],
    [ 0.5, 0.2, 0.3, 0.7 ],
    [ 0.1, 0.2, 0.3, 0.5 ],
    [ 0.9, 0.2, 0.3, 0.4 ],
])

def vote_majority(p):
    voted = numpy.bincount(numpy.argmax(p, axis=1))
    normalized = voted / p.shape[0]
    return normalized

def vote_average(p):
    return numpy.mean(p, axis=0)

def vote_average_logits(p):
    logits = numpy.log(p / (1 - p))
    avg = numpy.mean(logits, axis=1)
    p = 1/(1+ numpy.exp(-avg))
    return p


maj = vote_majority(frame_predictions)
mean = vote_average(frame_predictions)
mean_logits = vote_average_logits(frame_predictions)

genre_maj = binarizer.inverse_transform(numpy.array([maj]))
genre_mean = binarizer.inverse_transform(numpy.array([mean]))
genre_mean_logits = binarizer.inverse_transform(numpy.array([mean_logits]))
print('majority voting', maj, genre_maj)
print('mean voting', mean, genre_mean)
print('mean logits voting', mean_logits, genre_mean_logits)

输出

labels:
 rock
 jazz
 blues
 metal
y:
 [[0 0 0 1]
 [0 1 0 0]
 [1 0 0 0]
 [0 0 1 0]]
majority voting: [0.4 0.  0.  0.6] ['rock']
mean voting: [0.58 0.2  0.3  0.56] ['blues']
mean logits voting [0.49772704 0.44499443 0.41421356 0.24829914 0.4724135 ] ['blues']

对平均概率的一个简单改进是计算概率的logit(对数奇数)并将其平均.这更恰当地说明了很有可能或不太可能发生的事情.可以看作是朴素贝叶斯,假设框架是独立的,然后计算后验概率.

A simple improvement over averaging probabilities, is to compute the logits (log-odds) of the probability and average that. This more properly accounts for things that are very likely or unlikely. It can be seen as a Naive Bayes, computing the posterior probability under the assumption that the frames are independent.

还可以通过使用在逐帧预测中训练的分类器来执行投票,尽管这种方法并不常见,并且在输入长度变化时会很复杂.可以使用简单的序列模型,即递归神经网络(RNN)或隐马尔可夫模型(HMM).

One can also perform voting by using a classifier trained on the frame-wise predictions, though this not so commonly done and is complicated when input length varies. A simple sequence model can be used, ie an Recurrent Neural Network (RNN) or a Hidden Markov Model (HMM).

另一种替代方法是在基于框架的分类上使用通过GlobalAveragePooling进行多实例学习,以在以下位置学习整首歌曲一次.

Another alternative is to use Multiple-Instance-Learning with GlobalAveragePooling over the frame-based classifications, to learn on whole songs at once.

这篇关于keras:如何将帧级预测汇总到歌曲级预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆