了解Keras中语音识别的CTC丢失 [英] Understanding CTC loss for speech recognition in Keras

查看：88 发布时间：2020/4/25 10:19:02 python tensorflow keras deep-learning ctc

本文介绍了了解Keras中语音识别的CTC丢失的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图了解CTC丢失如何用于语音识别以及如何在Keras中实现它.

I am trying to understand how CTC loss is working for speech recognition and how it can be implemented in Keras.

我认为我了解的内容(如果我错了，请纠正我！)

总的来说，CTC损失是在经典网络的基础上添加的，以便逐个元素地解码顺序的信息元素(文本或语音逐个字母地解码)，而不是直接解码一个元素块(例如一个单词).

Grossly, the CTC loss is added on top of a classical network in order to decode a sequential information element by element (letter by letter for text or speech) rather than directly decoding an element block directly (a word for example).

比方说，我们正在以MFCC的形式提供某些句子的语音.

Let's say we're feeding utterances of some sentences as MFCCs.

使用CTC损失的目的是学习如何使每个字母在每个时间步都与MFCC相匹配.因此，Dense + softmax输出层由与组成句子所需元素数量一样多的神经元组成:

The goal in using CTC-loss is to learn how to make each letter match the MFCC at each time step. Thus, the Dense+softmax output layer is composed by as many neurons as the number of elements needed for the composition of the sentences:

字母(a，b，...，z)
空白令牌(-)
一个空格(_)和一个结束字符(>)

然后，softmax层具有29个神经元(26个代表字母+一些特殊字符).

Then, the softmax layer has 29 neurons (26 for alphabet + some special characters).

要实现它，我发现我可以做这样的事情:

To implement it, i found that i can do something like this:

# CTC implementation from Keras example found at https://github.com/keras- 
# team/keras/blob/master/examples/image_ocr.py

def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    # the 2 is critical here since the first couple outputs of the RNN
    # tend to be garbage:
    # print "y_pred_shape: ", y_pred.shape
    y_pred = y_pred[:, 2:, :]
    # print "y_pred_shape: ", y_pred.shape
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)



input_data = Input(shape=(1000, 20))
#let's say each MFCC is (1000 timestamps x 20 features)

x = Bidirectional(lstm(...,return_sequences=True))(input_data)

x = Bidirectional(lstm(...,return_sequences=True))(x)

y_pred = TimeDistributed(Dense(units=ALPHABET_LENGTH, activation='softmax'))(x)

loss_out = Lambda(function=ctc_lambda_func, name='ctc', output_shape=(1,))(
                  [y_pred, y_true, input_length, label_length])

model = Model(inputs=[input_data, y_true, input_length,label_length], 
                      outputs=loss_out)

ALPHABET_LENGTH = 29(字母长度+特殊字符)

With ALPHABET_LENGTH = 29 (alphabet length + special characters)

并且:

y_true :包含真值标签的张量(样本，max_string_length).
y_pred :包含预测或softmax输出的张量(样本，time_steps，num_categories).
input_length :张量(样本，1)，包含y_pred中每个批处理项目的序列长度.
label_length :张量(样本，1)，包含y_true中每个批处理项目的序列长度.

y_true: tensor (samples, max_string_length) containing the truth labels.
y_pred: tensor (samples, time_steps, num_categories) containing the prediction, or output of the softmax.
input_length: tensor (samples, 1) containing the sequence length for each batch item in y_pred.
label_length: tensor (samples, 1) containing the sequence length for each batch item in y_true.

(源)

现在，我遇到了一些问题:

Now, i'm facing some problems:

我不了解的内容
- 这种植入方法是编码和使用CTC损失的正确方法吗?
- 我不理解具体的 y_true ， input_length 和 label_length .有例子吗?
- 我应该以什么形式给网络贴标签?再说一遍吗?

What i don't understand
- Is this implantation the right way to code and use CTC loss?
- I do not understand what are concretely y_true, input_length and label_length. Any examples?
- In what form should I give the labels to the network? Again, Any examples?

这些是什么?

y_true您的地面真实数据.您将要在训练中将要与模型输出进行比较的数据. (另一方面，y_pred是模型的计算输出)
input_length，y_pred张量中每个样本(句子)的长度(分步，或在这种情况下为char)(如此处)
label_length，y_true(或标签)张量中每个样本(句子)的长度(分步，在此情况下为char).

What are these?

y_true your ground truth data. The data you are going to compare with the model's outputs in training. (On the other hand, y_pred is the model's calculated output)
input_length, the length (in steps, or chars this case) of each sample (sentence) in the y_pred tensor (as said here)
label_length, the length (in steps, or chars this case) of each sample (sentence) in the y_true (or labels) tensor.

似乎这种损失期望您模型的输出(y_pred)具有不同的长度，以及您的地面真实数据(y_true).这可能是为了避免在句子结束后计算垃圾字符的损失(因为您将需要一个固定大小的张量来一次处理大量句子)

It seems this loss expects that your model's outputs (y_pred) have different lengths, as well as your ground truth data (y_true). This is probably to avoid calculating the loss for garbage characters after the end of the sentences (since you will need a fixed size tensor for working with lots of sentences at once)

由于函数的文档要求输入形状(samples, length)，因此格式为...每个句子中每个char的char索引.

Since the function's documentation is asking for shape (samples, length), the format is that... the char index for each char in each sentence.

有一些可能性.

如果所有长度都相同，则可以轻松地将其用作常规损失:

If all lengths are the same, you can easily use it as a regular loss:

def ctc_loss(y_true, y_pred):

    return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)
    #where input_length and label_length are constants you created previously
    #the easiest way here is to have a fixed batch size in training 
    #the lengths should have the same batch size (see shapes in the link for ctc_cost)    

model.compile(loss=ctc_loss, ...)   

#here is how you pass the labels for training
model.fit(input_data_X_train, ground_truth_data_Y_train, ....)

2-如果您在乎长度.

这有点复杂，您需要模型以某种方式告诉您每个输出语句的长度.
再次有几种创造性的方式可以做到这一点:

2 - If you care about the lengths.

This is a little more complicated, you need that your model somehow tells you the length of each output sentence.
There are again several creative forms of doing this:

具有"end_of_sentence"字符，并检测其在句子中的位置.
模型的一个分支可以计算该数字并将其四舍五入为整数.
(硬核)如果您使用的是有状态的手动训练循环，请获取您决定完成一个句子的迭代的索引

我喜欢第一个想法，这里将举例说明.

I like the first idea, and will exemplify it here.

def ctc_find_eos(y_true, y_pred):

    #convert y_pred from one-hot to label indices
    y_pred_ind = K.argmax(y_pred, axis=-1)

    #to make sure y_pred has one end_of_sentence (to avoid errors)
    y_pred_end = K.concatenate([
                                  y_pred_ind[:,:-1], 
                                  eos_index * K.ones_like(y_pred_ind[:,-1:])
                               ], axis = 1)

    #to make sure the first occurrence of the char is more important than subsequent ones
    occurrence_weights = K.arange(start = max_length, stop=0, dtype=K.floatx())

    #is eos?
    is_eos_true = K.cast_to_floatx(K.equal(y_true, eos_index))
    is_eos_pred = K.cast_to_floatx(K.equal(y_pred_end, eos_index))

    #lengths
    true_lengths = 1 + K.argmax(occurrence_weights * is_eos_true, axis=1)
    pred_lengths = 1 + K.argmax(occurrence_weights * is_eos_pred, axis=1)

    #reshape
    true_lengths = K.reshape(true_lengths, (-1,1))
    pred_lengths = K.reshape(pred_lengths, (-1,1))

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

model.compile(loss=ctc_find_eos, ....)

如果使用其他选项，请使用模型分支来计算长度，将这些长度连接到输出的第一步或最后一步，并确保对地面真实数据中的真实长度进行相同的操作.然后，在损失函数中，仅取长度部分:

If you use the other option, use a model branch to calculate the lengths, concatenate these length to the first or last step of the output, and make sure you do the same with the true lengths in your ground truth data. Then, in the loss function, just take the section for lengths:

def ctc_concatenated_length(y_true, y_pred):

    #assuming you concatenated the length in the first step
    true_lengths = y_true[:,:1] #may need to cast to int
    y_true = y_true[:, 1:]

    #since y_pred uses one-hot, you will need to concatenate to full size of the last axis, 
    #thus the 0 here
    pred_lengths = K.cast(y_pred[:, :1, 0], "int32")
    y_pred = y_pred[:, 1:]

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

这篇关于了解Keras中语音识别的CTC丢失的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

了解Keras中语音识别的CTC丢失 [英] Understanding CTC loss for speech recognition in Keras

问题描述

推荐答案

这些是什么?

What are these?

2-如果您在乎长度.

2 - If you care about the lengths.

相关文章

Python最新文章

热门教程

热门工具

登录关闭

了解Keras中语音识别的CTC丢失 [英] Understanding CTC loss for speech recognition in Keras

问题描述

推荐答案

这些是什么?

What are these?

2-如果您在乎长度.

2 - If you care about the lengths.

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭