keras:如何编写定制的损失函数以将帧级预测聚合到歌曲级预测 [英] keras: how to write customized loss function to aggregate over frame-level predictions to song-level prediction

查看:31
本文介绍了keras:如何编写定制的损失函数以将帧级预测聚合到歌曲级预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做歌曲流派分类(2 类).对于每首歌曲,我将它们切成小帧 (5s) 以生成 MFCC 作为神经网络的输入特征,并且每一帧都有一个相关的歌曲类型标签.

I am doing a song genre classification (2 classes). For each song, I have chopped them into small frames (5s) to generate MFCC as input features for a neural network and each frame has an associated song genre label.

数据如下所示:

 name         label   feature
 ....
 song_i_frame1 label   feature_vector_frame1
 song_i_frame2 label   feature_vector_frame2
 ...
 song_i_framek label   feature_vector_framek
 ...

我知道我可以随机选择 80% 的歌曲(它们的小帧)作为训练数据,其余的作为测试.但是现在我写 X_train 的方式是帧级别的帧,而二进制交叉熵损失函数是在帧级别定义的.我想知道如何自定义损失函数,使其在帧级预测的聚合(例如歌曲的每一帧预测的多数票)上最小化.

I know that I can randomly pick say 80% of songs (their small frames) as training data and the rest as testing. But now the way I write X_train is a frame at the frame level and biney cross-entropy loss function is defined at the frame level. I am wondering how I can customize the loss function such that it is minimized over the aggregation (e.g. majority vote of each frame prediction of the song) of frame level prediction.

目前,我拥有的是:

model_19mfcc = Model(input_shape = (X_train19.shape[1], X_train19.shape[2]))
model_19mfcc.compile(loss='binary_crossentropy', optimizer="RMSProp", metrics=["accuracy"])
history_fit = model_19mfcc.fit(X_train19, y_train,validation_split=0.25, batch_size = 1800/50, epochs= 200)

此外,当我将训练和测试数据输入 keras 时,数据的相应 ID(名称)丢失,将数据(名称、lebel 和特征)保存在单独的 Pandas 数据框中并匹配预测从 keras 一个好的做法?或者有其他好的选择吗?

Also, when I feed into the training and testing data into keras the corresponding ID (name) of the data is lost, is keeping the data (name, lebel, and feature) in a separate pandas dataframe and matching back the prediction from keras a good practice? or are there other good alternatives?

提前致谢!

推荐答案

流派分类通常不需要定制的损失函数.可以使用 多实例学习 (MIL) 设置将歌曲分成多个预测窗口的组合模型.

A customized loss function is usually not needed for genre classification. A combined model a song split into multiple prediction windows can be setup with Multiple Instance Learning (MIL).

MIL 是一种监督学习方法,其中标签不在每个独立样本(实例)上,而是在实例的袋子"(无序集)上.在您的情况下,实例是 MFCC 功能的每 5 秒窗口,包是整首歌曲.

MIL is a supervised learning approach where the label not on each independent sample (instances), but instead of a "bag" (unordered set) of instances. In your case the instance is each 5 second window of MFCC features, and the bag is the entire song.

在 Keras 中,我们使用 TimeDistributed 层为所有窗口执行我们的模型.然后我们使用 GlobalAveragePooling1D 组合结果,有效地跨窗口实施均值投票.这比多数投票更容易区分.

In Keras we use TimeDistributed layer to execute our model for all windows. Then we combine the result using GlobalAveragePooling1D, effectively implementing mean-voting across the windows. This is more easily differentiable than majority voting.

下面是一个可运行的例子:

Below is a runnable example:

import math

import keras
import librosa
import pandas
import numpy
import sklearn

def window_model(n_bands, n_frames, n_classes, hidden=32):
   from keras.layers import Input, Dense, Flatten, Conv2D, MaxPooling2D

   out_units = 1 if n_classes == 2 else n_classes
   out_activation = 'sigmoid' if n_classes == 2 else 'softmax'

   shape = (n_bands, n_frames, 1)

   # Basic CNN model
   # An MLP could also be used, but may need to reshape on input and output
   model = keras.Sequential([
       Conv2D(16, (3,3), input_shape=shape),
       MaxPooling2D((2,3)),
       Conv2D(16, (3,3)),
       MaxPooling2D((2,2)),
       Flatten(),
       Dense(hidden, activation='relu'),
       Dense(hidden, activation='relu'),
       Dense(out_units, activation=out_activation),
   ])
   return model

def song_model(n_bands, n_frames, n_windows, n_classes=3):
    from keras.layers import Input, TimeDistributed, GlobalAveragePooling1D

    # Create the frame-wise model, will be reused across all frames
    base = window_model(n_bands, n_frames, n_classes)
    # GlobalAveragePooling1D expects a 'channel' dimension at end
    shape = (n_windows, n_bands, n_frames, 1)

    print('Frame model')
    base.summary()

    model = keras.Sequential([
        TimeDistributed(base, input_shape=shape),
        GlobalAveragePooling1D(),
    ])

    print('Song model')
    model.summary()

    model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['acc'])
    return model


def extract_features(path, sample_rate, n_bands, hop_length, n_frames, window_length, song_length):
    # melspectrogram might perform better with CNNs
    from librosa.feature import mfcc

    # Load a fixed length section of sound
    # Might need to pad if some songs are too short
    y, sr = librosa.load(path, sr=sample_rate, offset=0, duration=song_length)
    assert sr == sample_rate, sr
    _song_length = len(y)/sample_rate

    assert _song_length == song_length, _song_length

    # Split into windows
    window_samples = int(sample_rate * window_length)
    window_hop = window_samples//2 # use 50% overlap
    windows = librosa.util.frame(y, frame_length=window_samples, hop_length=window_hop)

    # Calculate features for each window
    features = []
    for w in range(windows.shape[1]):
        win = windows[:, w]
        f = mfcc(y=win, sr=sample_rate, n_mfcc=n_bands,
                 hop_length=hop_length, n_fft=2*hop_length)
        f = numpy.expand_dims(f, -1) # add channels dimension 
        features.append(f)

    features = numpy.stack(features)
    return features

def main():

    # Settings for our model
    n_bands = 13 # MFCCs
    sample_rate = 22050
    hop_length = 512
    window_length = 5.0
    song_length_max = 1.0*60
    n_frames = math.ceil(window_length / (hop_length/sample_rate))
    n_windows = math.floor(song_length_max / (window_length/2))-1

    model = song_model(n_bands, n_frames, n_windows)

    # Generate some example data
    ex =  librosa.util.example_audio_file()
    examples = 8
    numpy.random.seed(2)
    songs = pandas.DataFrame({
        'path': [ex] * examples,
        'genre': numpy.random.choice([ 'rock', 'metal', 'blues' ], size=examples),
    })
    assert len(songs.genre.unique() == 3) 

    print('Song data')
    print(songs)

    def get_features(path):
        f = extract_features(path, sample_rate, n_bands,
                    hop_length, n_frames, window_length, song_length_max)
        return f

    from sklearn.preprocessing import LabelBinarizer

    binarizer = LabelBinarizer()
    y = binarizer.fit_transform(songs.genre.values)
    print('y', y.shape, y)

    features = numpy.stack([ get_features(p) for p in songs.path ])
    print('features', features.shape)

    model.fit(features, y) 


if __name__ == '__main__':
    main()

该示例输出内部和组合模型摘要:

The example outputs the inner and combined model summaries:

Frame model
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 11, 214, 16)       160       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 71, 16)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 69, 16)         2320      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 1, 34, 16)         0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 544)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                17440     
_________________________________________________________________
dense_2 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 99        
=================================================================
Total params: 21,075
Trainable params: 21,075
Non-trainable params: 0

_________________________________________________________________
Song model
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_1 (TimeDist (None, 23, 3)             21075     
_________________________________________________________________
global_average_pooling1d_1 ( (None, 3)                 0         
=================================================================
Total params: 21,075
Trainable params: 21,075
Non-trainable params: 0
_________________________________________________________________

以及输入模型的特征向量的形状:

And the shape of the feature vector fed to the model:

features (8, 23, 13, 216, 1)

8 首歌曲,每首歌曲 23 个窗口,13 个 MFCC 频段,每个窗口 216 帧.还有一个大小为 1 的第五维,让 Keras 开心......

8 songs, 23 windows each, with 13 MFCC bands, 216 frames in each window. And a fifth dimension sized 1 to make Keras happy...

这篇关于keras:如何编写定制的损失函数以将帧级预测聚合到歌曲级预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆