keras:如何编写自定义的损失函数以将帧级预测聚合为歌曲级预测 [英] keras: how to write customized loss function to aggregate over frame-level predictions to song-level prediction

查看:42
本文介绍了keras:如何编写自定义的损失函数以将帧级预测聚合为歌曲级预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对歌曲类型进行分类(2个班级).对于每首歌曲,我将它们切成小帧(5s)以生成MFCC作为神经网络的输入特征,并且每帧都有相关的歌曲类型标签.

I am doing a song genre classification (2 classes). For each song, I have chopped them into small frames (5s) to generate MFCC as input features for a neural network and each frame has an associated song genre label.

数据如下:

 name         label   feature
 ....
 song_i_frame1 label   feature_vector_frame1
 song_i_frame2 label   feature_vector_frame2
 ...
 song_i_framek label   feature_vector_framek
 ...

我知道我可以随机选择说80%的歌曲(它们的小框架)作为训练数据,其余的作为测试数据.但是现在我写X_train的方式是在帧级别定义一个帧,并且在帧级别定义了biney交叉熵损失函数.我想知道如何自定义损失函数,以便在帧级别预测的汇总(例如歌曲的每个帧预测的多数表决)中将其最小化.

I know that I can randomly pick say 80% of songs (their small frames) as training data and the rest as testing. But now the way I write X_train is a frame at the frame level and biney cross-entropy loss function is defined at the frame level. I am wondering how I can customize the loss function such that it is minimized over the aggregation (e.g. majority vote of each frame prediction of the song) of frame level prediction.

目前,我所拥有的是:

model_19mfcc = Model(input_shape = (X_train19.shape[1], X_train19.shape[2]))
model_19mfcc.compile(loss='binary_crossentropy', optimizer="RMSProp", metrics=["accuracy"])
history_fit = model_19mfcc.fit(X_train19, y_train,validation_split=0.25, batch_size = 1800/50, epochs= 200)

此外,当我将训练和测试数据输入到keras中时,数据的相应ID(名称)丢失了,它会将数据(名称,lebel和要素)保存在单独的pandas数据框中,并与预测相匹配来自喀拉拉邦的好习惯?还是有其他好的选择?

Also, when I feed into the training and testing data into keras the corresponding ID (name) of the data is lost, is keeping the data (name, lebel, and feature) in a separate pandas dataframe and matching back the prediction from keras a good practice? or are there other good alternatives?

提前谢谢!

推荐答案

流派分类通常不需要自定义的损失函数. 可以使用多实例学习(MIL)设置将歌曲分成多个预测窗口的组合模型.

A customized loss function is usually not needed for genre classification. A combined model a song split into multiple prediction windows can be setup with Multiple Instance Learning (MIL).

MIL是一种监督式学习方法,其中标签不是在每个独立样本(实例)上,而是在实例的袋"(无序集合)上. 在您的情况下,实例是MFCC功能的每个5秒窗口,而包就是整首歌.

MIL is a supervised learning approach where the label not on each independent sample (instances), but instead of a "bag" (unordered set) of instances. In your case the instance is each 5 second window of MFCC features, and the bag is the entire song.

在Keras中,我们使用TimeDistributed层对所有窗口执行模型. 然后我们使用GlobalAveragePooling1D有效地合并结果 跨窗口实施平均投票.这比多数投票更容易区分.

In Keras we use TimeDistributed layer to execute our model for all windows. Then we combine the result using GlobalAveragePooling1D, effectively implementing mean-voting across the windows. This is more easily differentiable than majority voting.

下面是一个可运行的示例:

Below is a runnable example:

import math

import keras
import librosa
import pandas
import numpy
import sklearn

def window_model(n_bands, n_frames, n_classes, hidden=32):
   from keras.layers import Input, Dense, Flatten, Conv2D, MaxPooling2D

   out_units = 1 if n_classes == 2 else n_classes
   out_activation = 'sigmoid' if n_classes == 2 else 'softmax'

   shape = (n_bands, n_frames, 1)

   # Basic CNN model
   # An MLP could also be used, but may need to reshape on input and output
   model = keras.Sequential([
       Conv2D(16, (3,3), input_shape=shape),
       MaxPooling2D((2,3)),
       Conv2D(16, (3,3)),
       MaxPooling2D((2,2)),
       Flatten(),
       Dense(hidden, activation='relu'),
       Dense(hidden, activation='relu'),
       Dense(out_units, activation=out_activation),
   ])
   return model

def song_model(n_bands, n_frames, n_windows, n_classes=3):
    from keras.layers import Input, TimeDistributed, GlobalAveragePooling1D

    # Create the frame-wise model, will be reused across all frames
    base = window_model(n_bands, n_frames, n_classes)
    # GlobalAveragePooling1D expects a 'channel' dimension at end
    shape = (n_windows, n_bands, n_frames, 1)

    print('Frame model')
    base.summary()

    model = keras.Sequential([
        TimeDistributed(base, input_shape=shape),
        GlobalAveragePooling1D(),
    ])

    print('Song model')
    model.summary()

    model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['acc'])
    return model


def extract_features(path, sample_rate, n_bands, hop_length, n_frames, window_length, song_length):
    # melspectrogram might perform better with CNNs
    from librosa.feature import mfcc

    # Load a fixed length section of sound
    # Might need to pad if some songs are too short
    y, sr = librosa.load(path, sr=sample_rate, offset=0, duration=song_length)
    assert sr == sample_rate, sr
    _song_length = len(y)/sample_rate

    assert _song_length == song_length, _song_length

    # Split into windows
    window_samples = int(sample_rate * window_length)
    window_hop = window_samples//2 # use 50% overlap
    windows = librosa.util.frame(y, frame_length=window_samples, hop_length=window_hop)

    # Calculate features for each window
    features = []
    for w in range(windows.shape[1]):
        win = windows[:, w]
        f = mfcc(y=win, sr=sample_rate, n_mfcc=n_bands,
                 hop_length=hop_length, n_fft=2*hop_length)
        f = numpy.expand_dims(f, -1) # add channels dimension 
        features.append(f)

    features = numpy.stack(features)
    return features

def main():

    # Settings for our model
    n_bands = 13 # MFCCs
    sample_rate = 22050
    hop_length = 512
    window_length = 5.0
    song_length_max = 1.0*60
    n_frames = math.ceil(window_length / (hop_length/sample_rate))
    n_windows = math.floor(song_length_max / (window_length/2))-1

    model = song_model(n_bands, n_frames, n_windows)

    # Generate some example data
    ex =  librosa.util.example_audio_file()
    examples = 8
    numpy.random.seed(2)
    songs = pandas.DataFrame({
        'path': [ex] * examples,
        'genre': numpy.random.choice([ 'rock', 'metal', 'blues' ], size=examples),
    })
    assert len(songs.genre.unique() == 3) 

    print('Song data')
    print(songs)

    def get_features(path):
        f = extract_features(path, sample_rate, n_bands,
                    hop_length, n_frames, window_length, song_length_max)
        return f

    from sklearn.preprocessing import LabelBinarizer

    binarizer = LabelBinarizer()
    y = binarizer.fit_transform(songs.genre.values)
    print('y', y.shape, y)

    features = numpy.stack([ get_features(p) for p in songs.path ])
    print('features', features.shape)

    model.fit(features, y) 


if __name__ == '__main__':
    main()

该示例输出内部模型摘要和组合模型摘要:

The example outputs the inner and combined model summaries:

Frame model
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 11, 214, 16)       160       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 71, 16)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 69, 16)         2320      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 1, 34, 16)         0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 544)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                17440     
_________________________________________________________________
dense_2 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 99        
=================================================================
Total params: 21,075
Trainable params: 21,075
Non-trainable params: 0

_________________________________________________________________
Song model
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_1 (TimeDist (None, 23, 3)             21075     
_________________________________________________________________
global_average_pooling1d_1 ( (None, 3)                 0         
=================================================================
Total params: 21,075
Trainable params: 21,075
Non-trainable params: 0
_________________________________________________________________

以及将特征向量的形状馈入模型:

And the shape of the feature vector fed to the model:

features (8, 23, 13, 216, 1)

8首歌曲,每个23个窗口,带有13个MFCC波段,每个窗口216帧. 第五维尺寸为1,使Keras开心...

8 songs, 23 windows each, with 13 MFCC bands, 216 frames in each window. And a fifth dimension sized 1 to make Keras happy...

这篇关于keras:如何编写自定义的损失函数以将帧级预测聚合为歌曲级预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆