Doc2vec:model.docvecs仅长度为10 [英] Doc2vec: model.docvecs is only of length 10

查看:222
本文介绍了Doc2vec:model.docvecs仅长度为10的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用doc2vec来处理600000行句子,而我的代码如下:

I am trying doc2vec for 600000 rows of sentences and my code is below:

model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores)
model.build_vocab(res) 
model.train(res, total_examples=model.corpus_count, epochs=model.iter)

#len(res) = 663406

#length of unique words 15581
print(len(model.wv.vocab))

#length of doc vectors is 10
len(model.docvecs)

# each of length 100
len(model.docvecs[1])

如何解释此结果?为什么向量的长度只有10个而大小均为100?当"res"的长度为663406时,这没有任何意义.我知道这里有问题.

How do I interpret this result? why is the length of vector only 10 with each of size 100? when the length of 'res' is 663406, it does not make sense. I know something is wrong here.

了解Gensim包中Doc2Vec的输出,他们提到docvec的长度由大小"决定,尚不清楚.

In Understanding the output of Doc2Vec from Gensim package, they mention that the length of docvec is determined by 'size' which is not clear.

推荐答案

TaggedDocumenttags应该是标签列表.如果您改为提供像tags='73215'这样的字符串,则将其视为与字符列表相同:

The tags of a TaggedDocument should be a list-of-tags. If you instead provided strings, like tags='73215', that would be seen as if the same as the list-of-characters:

tags=['7', '3', '2', '1', '5']

最后,整个训练集中只有10个标签,只有10个数字在各种组合中.

At the end, you'd only have 10 tags in your whole training set, just the 10 digits in various combinations.

您的len(model.docvec[1])为100表示​​您在构建TaggedDocument训练数据时并没有完全犯此错误,但可能与此类似.

That your len(model.docvec[1]) is 100 means you didn't make exactly this error, but perhaps something similar, in constructing your TaggedDocument training data.

查看res中的第一项,以查看其tags属性是否有意义,以及每个model.docvecs,以查看正在使用的内容而不是您的预期.

Look at the first item in res, to see if its tags property makes sense, and each of the model.docvecs, to see what's being used instead of what you intended.

这篇关于Doc2vec:model.docvecs仅长度为10的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆