Gensim word2vec在预定义的词典和单词索引数据上 [英] Gensim word2vec on predefined dictionary and word-indices data

查看:596
本文介绍了Gensim word2vec在预定义的词典和单词索引数据上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用gensim在推文上训练word2vec表示形式.与我在gensim上看到的大多数教程和代码不同,我的数据不是原始数据,而是已经过预处理的数据.我在一个文本文档中有一个字典,其中包含65k个单词(包括未知"令牌和EOL令牌),并且这些推文被保​​存为带有该字典索引的numpy矩阵.数据格式的简单示例如下:

I need to train a word2vec representation on tweets using gensim. Unlike most tutorials and code I've seen on gensim my data is not raw, but has already been preprocessed. I have a dictionary in a text document containing 65k words (incl. an "unknown" token and a EOL token) and the tweets are saved as a numpy matrix with indices into this dictionary. A simple example of the data format can be seen below:

dict.txt

you
love
this
code

tweets(5条未知,6条EOL)

[[0, 1, 2, 3, 6],
 [3, 5, 5, 1, 6],
 [0, 1, 3, 6, 6]]

我不确定应该如何处理索引表示.一种简单的方法是将索引列表转换为字符串列表(即[0,1,2,3,6]-> ['0','1','2','3','6 ']),当我将其读入word2vec模型中时.但是,这必须是低效率的,因为gensim然后将尝试查找用于例如'2'.

I'm unsure how I should handle the indices representation. An easy way is just to convert the list of indices to a list of strings (i.e. [0, 1, 2, 3, 6] -> ['0', '1', '2', '3', '6']) as I read it into the word2vec model. However, this must be inefficient as gensim then will try to look up the internal index used for e.g. '2'.

如何使用gensim高效地加载此数据并创建word2vec表示形式?

How do I load this data and create the word2vec representation in an efficient manner using gensim?

推荐答案

gensim中初始化Word2Vec模型的正常方法是[1]

The normal way to initialize a Word2Vec model in gensim is [1]

model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

问题是,什么是sentences? sentences应该是单词/标记的可迭代项的迭代器.就像您拥有的numpy矩阵一样,但是每一行的长度可以不同.

The question is, what is sentences? sentences is supposed to be an iterator of iterables of words/tokens. It is just like the numpy matrix you have, but each row can be of different lengths.

如果您查看gensim.models.word2vec.LineSentence的文档,它将为您提供一种直接将文本文件加载为句子的方法.作为提示,根据文档,它需要

If you look at the documentation for gensim.models.word2vec.LineSentence, it gives you a way of loading a text files as sentences directly. As a hint, according to the documentation, it takes

一句话=一行;已预处理的单词,并用空格分隔.

one sentence = one line; words already preprocessed and separated by whitespace.

当说words already preprocessed时,它是指小写字母,词干,停用词过滤和所有其他文本清除过程.在您的情况下,您不希望56出现在句子列表中,因此您需要将它们过滤掉.

When it says words already preprocessed, it is referring to lower-casing, stemming, stopword filtering and all other text cleansing processes. In your case you wouldn't want 5 and 6 to be in your list of sentences, so you do need to filter them out.

鉴于您已经有了numpy矩阵,假设每一行都是一个句子,最好将其转换为2d数组并过滤掉所有56.生成的2d数组可以直接用作sentences参数来初始化模型.唯一的不足是,当您想在训练后查询模型时,需要输入索引而不是标记.

Given that you already have the numpy matrix, assuming each row is a sentence, it is better to then cast it into a 2d array and filter out all 5 and 6. The resultant 2d array can be used directly as the sentences argument to initialize the model. The only catch is that when you want to query the model after training, you need to input the indices instead of the tokens.

现在您有一个问题是该模型是否直接采用整数.在Python版本中,它不检查类型,而只是传递唯一的标记.在这种情况下,您的唯一索引会正常工作.但是大多数时候,您会希望使用C-Extended例程来训练模型,这很重要,因为它可以提供70倍的性能. [2]我想在那种情况下,C代码可能会检查字符串类型,这意味着存储了字符串到索引的映射.

Now one question you have is if the model takes integer directly. In the Python version it doesn't check for type, and just passes the unique tokens around. Your unique indices in that case will work fine. But most of the time you would want to use the C-Extended routine to train your model, which is a big deal because it can give 70x performance. [2] I imagine in that case the C code may check for string type, which means there is a string-to-index mapping stored.

效率低下吗?我认为不是,因为您拥有的字符串是数字,通常比它们表示的真实标记短得多(假设它们是来自0的紧凑索引).因此,模型将更小,这将节省最后的模型序列化和反序列化工作.本质上,您已经以较短的字符串格式对输入令牌进行了编码,并将其与word2vec训练分开,并且word2vec模型不需要,并且不需要知道这种编码是在训练之前发生的.

Is this inefficient? I think not, because the strings you have are numbers, which are in generally much shorter than the real token they represent (assuming they are compact indices from 0). Therefore models will be smaller in size, which will save some effort in serialization and deserialization of the model at the end. You essentially have encoded the input tokens in a shorter string format and separated it from the word2vec training, and word2vec model do not and need not know this encoding happened before training.

我的哲学是try the simplest way first.我只是将整数的样本测试输入扔给模型,看看会出什么问题.希望对您有所帮助.

My philosophy is try the simplest way first. I would just throw a sample test input of integers to the model and see what can go wrong. Hope it helps.

[1] https://radimrehurek.com/gensim/models/word2vec.html

[2] http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

这篇关于Gensim word2vec在预定义的词典和单词索引数据上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆