将keras标记器用于不在训练集中的新单词 [英] Using keras tokenizer for new words not in training set

查看:74
本文介绍了将keras标记器用于不在训练集中的新单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用Keras令牌生成器来创建单词索引,然后将该单词索引与导入的GloVe词典进行匹配以创建嵌入矩阵.但是,我的问题是,这似乎无法使用词向量嵌入的优势之一,因为当使用经过训练的模型进行预测时,如果它遇到了不在分词器的词索引中的新词,则会将其从序列中删除.

I'm currently using the Keras Tokenizer to create a word index and then matching that word index to the the imported GloVe dictionary to create an embedding matrix. However, the problem I have is that this seems to defeat one of the advantages of using a word vector embedding since when using the trained model for predictions if it runs into a new word that's not in the tokenizer's word index it removes it from the sequence.

#fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index

#load glove embedding into a dict
embeddings_index = {}
dims = 100
glove_data = 'glove.6B.'+str(dims)+'d.txt'
f = open(glove_data)
for line in f:
    values = line.split()
    word = values[0]
    value = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = value
f.close()

#create embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, dims))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector[:dims]

#Embedding layer:
embedding_layer = Embedding(embedding_matrix.shape[0],
                        embedding_matrix.shape[1],
                        weights=[embedding_matrix],
                        input_length=12)

#then to make a prediction
sequence = tokenizer.texts_to_sequences(["Test sentence"])
model.predict(sequence)

那么,有没有一种方法可以让我仍然使用令牌生成器将句子转换为数组,并且仍然尽可能多地使用GloVe词典中的单词,而不是仅使用训练文本中显示的单词?

So is there a way I can still use the tokenizer to transform sentences into an array and still use as much of the words GloVe dictionary as I can instead of only the ones that show up in my training text?

经过进一步的考虑,我想一个选择是在适合分词器的文本上添加一个或多个文本,其中包括手套字典中的键列表.如果我想使用tf-idf,虽然这可能会与一些统计数据混淆.有没有更好的方法或者有更好的方法呢?

Upon further contemplation, I guess one option would be to add a text or texts to the texts that the tokenizer is fit on that includes a list of the keys in the glove dictionary. Though that might mess with some of the statistics if I want to use tf-idf. Is there either a preferable way to doing this or a different better approach?

推荐答案

在Keras令牌生成器中,您具有 oov_token 参数.只需选择您的令牌,未知单词就会拥有该令牌.

In Keras Tokenizer you have the oov_token parameter. Just select your token and unknown words will have that one.

tokenizer_a = Tokenizer(oov_token=1)
tokenizer_b = Tokenizer()
tokenizer_a.fit_on_texts(["Hello world"])
tokenizer_b.fit_on_texts(["Hello world"])

输出

In [26]: tokenizer_a.texts_to_sequences(["Hello cruel world"])
Out[26]: [[2, 1, 3]]

In [27]: tokenizer_b.texts_to_sequences(["Hello cruel world"])
Out[27]: [[1, 2]]

这篇关于将keras标记器用于不在训练集中的新单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆