如何使用 LSTM 构建语言模型,为给定的句子分配出现概率 [英] How to build a Language model using LSTM that assigns probability of occurence for a given sentence

查看:21
本文介绍了如何使用 LSTM 构建语言模型,为给定的句子分配出现概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我正在使用 Trigram 来执行此操作.它为给定的句子分配出现的概率.但它仅限于 2 个单词的唯一上下文.但是 LSTM 可以做得更多.那么如何构建一个 LSTM 模型,为给定的句子分配出现概率?

Currently, I am using Trigram to do this. It assigns the probability of occurrence for a given sentence. But Its limited to the only context of 2 words. But LSTM's can do more. So how to build an LSTM Model that assigns the probability of occurrence for a given sentence?

推荐答案

我刚刚编写了一个非常简单的示例,展示了如何使用 LSTM 模型计算句子出现的概率.可以在此处找到完整代码.

I have just coded a very simple example showing how one might compute the probability of occurrence of a sentence with a LSTM model. The full code can be found here.

假设我们要预测以下数据集的句子出现的概率(该韵律于 1765 年左右在伦敦的 Mother Goose's Melody 中发表):

Suppose we want to predict the probability of occurrence of a sentence for the following dataset (this rhyme was published in Mother Goose's Melody in London around 1765):

# Data
data = ["Two little dicky birds",
        "Sat on a wall,",
        "One called Peter,",
        "One called Paul.",
        "Fly away, Peter,",
        "Fly away, Paul!",
        "Come back, Peter,",
        "Come back, Paul."]

首先,让我们使用 keras.preprocessing.text.Tokenizer 创建一个词汇并标记句子:

First of all, let's use keras.preprocessing.text.Tokenizer to create a vocabulary and tokenize the sentences:

# Preprocess data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
vocab = tokenizer.word_index
seqs = tokenizer.texts_to_sequences(data)

我们的模型将一系列单词作为输入(上下文),并输出给定​​上下文的词汇表中每个单词的条件概率分布.为此,我们通过填充序列和在它们上滑动窗口来准备训练数据:

Our model will take a sequence of words as input (context), and will output the conditional probability distribution of each word in the vocabulary given the context. To this end, we prepare the training data by padding the sequences and sliding windows over them:

def prepare_sentence(seq, maxlen):
    # Pads seq and slides windows
    x = []
    y = []
    for i, w in enumerate(seq):
        x_padded = pad_sequences([seq[:i]],
                                 maxlen=maxlen - 1,
                                 padding='pre')[0]  # Pads before each sequence
        x.append(x_padded)
        y.append(w)
    return x, y

# Pad sequences and slide windows
maxlen = max([len(seq) for seq in seqs])
x = []
y = []
for seq in seqs:
    x_windows, y_windows = prepare_sentence(seq, maxlen)
    x += x_windows
    y += y_windows
x = np.array(x)
y = np.array(y) - 1  # The word <PAD> does not constitute a class
y = np.eye(len(vocab))[y]  # One hot encoding

我决定为每节经文单独滑动窗口,但这可以以不同的方式完成.

I decided to slide windows separately for each verse, but this could be done differently.

接下来,我们使用 Keras 定义并训练一个简单的 LSTM 模型.该模型由一个嵌入层、一个 LSTM 层和一个带有 softmax 激活的密集层组成(它使用 LSTM 最后一个时间步的输出来产生给定上下文的词汇表中每个单词的概率):

Next, we define and train a simple LSTM model with Keras. The model consists of an embedding layer, a LSTM layer, and a dense layer with a softmax activation (which uses the output at the last timestep of the LSTM to produce the probability of each word in the vocabulary given the context):

# Define model
model = Sequential()
model.add(Embedding(input_dim=len(vocab) + 1,  # vocabulary size. Adding an
                                               # extra element for <PAD> word
                    output_dim=5,  # size of embeddings
                    input_length=maxlen - 1))  # length of the padded sequences
model.add(LSTM(10))
model.add(Dense(len(vocab), activation='softmax'))
model.compile('rmsprop', 'categorical_crossentropy')

# Train network
model.fit(x, y, epochs=1000)

句子w_1 ... w_n出现的联合概率P(w_1, ..., w_n)可以使用条件概率规则计算:

The joint probability P(w_1, ..., w_n) of occurrence of a sentence w_1 ... w_n can be computed using the rule of conditional probability:

P(w_1, ..., w_n)=P(w_1)*P(w_2|w_1)*...*P(w_n|w_{n-1}, ..., w_1)

其中每个条件概率都由 LSTM 模型给出.请注意,它们可能非常小,因此在对数空间中工作以避免数值不稳定问题是明智的.综合起来:

where each of these conditional probabilities is given by the LSTM model. Note that they might be very small, so it is sensible to work in log space in order to avoid numerical instability issues. Putting it all together:

# Compute probability of occurence of a sentence
sentence = "One called Peter,"
tok = tokenizer.texts_to_sequences([sentence])[0]
x_test, y_test = prepare_sentence(tok, maxlen)
x_test = np.array(x_test)
y_test = np.array(y_test) - 1  # The word <PAD> does not constitute a class
p_pred = model.predict(x_test)  # array of conditional probabilities
vocab_inv = {v: k for k, v in vocab.items()}

# Compute product
# Efficient version: np.exp(np.sum(np.log(np.diag(p_pred[:, y_test]))))
log_p_sentence = 0
for i, prob in enumerate(p_pred):
    word = vocab_inv[y_test[i]+1]  # Index 0 from vocab is reserved to <PAD>
    history = ' '.join([vocab_inv[w] for w in x_test[i, :] if w != 0])
    prob_word = prob[y_test[i]]
    log_p_sentence += np.log(prob_word)
    print('P(w={}|h={})={}'.format(word, history, prob_word))
print('Prob. sentence: {}'.format(np.exp(log_p_sentence)))

注意:这是一个非常小的玩具数据集,我们可能会过度拟合.

NOTE: This is a very small toy dataset and we might be overfitting.

这篇关于如何使用 LSTM 构建语言模型,为给定的句子分配出现概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆