Doc2vec:model.docvecs仅长度为10 [英] Doc2vec: model.docvecs is only of length 10
问题描述
我正在尝试使用doc2vec来处理600000行句子,而我的代码如下:
I am trying doc2vec for 600000 rows of sentences and my code is below:
model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores)
model.build_vocab(res)
model.train(res, total_examples=model.corpus_count, epochs=model.iter)
#len(res) = 663406
#length of unique words 15581
print(len(model.wv.vocab))
#length of doc vectors is 10
len(model.docvecs)
# each of length 100
len(model.docvecs[1])
如何解释此结果?为什么向量的长度只有10个而大小均为100?当"res"的长度为663406时,这没有任何意义.我知道这里有问题.
How do I interpret this result? why is the length of vector only 10 with each of size 100? when the length of 'res' is 663406, it does not make sense. I know something is wrong here.
在了解Gensim包中Doc2Vec的输出,他们提到docvec的长度由大小"决定,尚不清楚.
In Understanding the output of Doc2Vec from Gensim package, they mention that the length of docvec is determined by 'size' which is not clear.
推荐答案
TaggedDocument
的tags
应该是标签列表.如果您改为提供像tags='73215'
这样的字符串,则将其视为与字符列表相同:
The tags
of a TaggedDocument
should be a list-of-tags. If you instead provided strings, like tags='73215'
, that would be seen as if the same as the list-of-characters:
tags=['7', '3', '2', '1', '5']
最后,整个训练集中只有10个标签,只有10个数字在各种组合中.
At the end, you'd only have 10 tags in your whole training set, just the 10 digits in various combinations.
您的len(model.docvec[1])
为100表示您在构建TaggedDocument
训练数据时并没有完全犯此错误,但可能与此类似.
That your len(model.docvec[1])
is 100 means you didn't make exactly this error, but perhaps something similar, in constructing your TaggedDocument
training data.
查看res
中的第一项,以查看其tags
属性是否有意义,以及每个model.docvecs
,以查看正在使用的内容而不是您的预期.
Look at the first item in res
, to see if its tags
property makes sense, and each of the model.docvecs
, to see what's being used instead of what you intended.
这篇关于Doc2vec:model.docvecs仅长度为10的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!