如何保留word2vec中的记录数? [英] how to preserve number of records in word2vec?
问题描述
我的数据框中有45000个文本记录.我想将这45000条记录转换为单词向量,以便可以在单词向量上训练分类器.我没有在句子上加上标记.我只是将每个条目分成单词列表.
I have 45000 text records in my dataframe. I wanted to convert those 45000 records into word vectors so that I can train a classifier on the word vector. I am not tokenizing the sentences. I just split the each entry into list of words.
在训练具有300个功能的word2vec模型后,模型的形状仅得到26000个.如何保存我的所有45000个记录?
After training word2vec model with 300 features, the shape of the model resulted in only 26000. How can I preserve all of my 45000 records ?
在分类器模型中,我需要所有45000条记录,以便它可以匹配45000条输出标签.
In the classifier model, I need all of those 45000 records, so that it can match 45000 output labels.
推荐答案
如果将每个条目分成单词列表,则本质上是令牌化".
If you are splitting each entry into a list of words, that's essentially 'tokenization'.
Word2Vec只是学习每个单词的向量,而不是每个文本示例(记录")的向量-因此无需保留"任何内容,也不会创建45,000条记录的向量.但是,如果记录中有26,000个唯一词(应用min_count
之后),最后将有26,000个向量.
Word2Vec just learns vectors for each word, not for each text example ('record') – so there's nothing to 'preserve', no vectors for the 45,000 records are ever created. But if there are 26,000 unique words among the records (after applying min_count
), you will have 26,000 vectors at the end.
Gensim的Doc2Vec( 段落向量"算法)可以为每个文本示例创建一个向量,因此您可能需要尝试一下.
Gensim's Doc2Vec (the ' Paragraph Vector' algorithm) can create a vector for each text example, so you may want to try that.
如果您只有单词向量,那么为较大的文本创建向量的一种简单方法就是将所有单个单词向量加在一起.进一步的选择包括在使用单位归一化的单词向量或许多幅度的原始单词向量之间进行选择;是否对单位和进行单位归一化;是否以其他任何重要因素(例如TF/IDF)对单词进行加权.
If you only have word-vectors, one simplistic way to create a vector for a larger text is to just add all the individual word vectors together. Further options include choosing between using the unit-normed word-vectors or raw word-vectors of many magnitudes; whether to then unit-norm the sum; and whether to otherwise weight the words by any other importance factor (such as TF/IDF).
请注意,除非您的文档很长,否则这对于Word2Vec或Doc2Vec来说都是一个很小的训练集.
Note that unless your documents are very long, this is a quite small training set for either Word2Vec or Doc2Vec.
这篇关于如何保留word2vec中的记录数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!