Gensim Doc2Vec异常AttributeError:'str'对象没有属性'words' [英] Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'

查看:275
本文介绍了Gensim Doc2Vec异常AttributeError:'str'对象没有属性'words'的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从gensim库中学习Doc2Vec模型,并按如下方式使用它:

I am learning Doc2Vec model from gensim library and using it as follows:

class MyTaggedDocument(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            with open(os.path.join(self.dirname, fname),encoding='utf-8') as fin:
                print(fname)
                for item_no, sentence in enumerate(fin):
                    yield LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no])
sentences = MyTaggedDocument(dirname)
model = Doc2Vec(sentences,min_count=2, window=10, size=300, sample=1e-4, negative=5, workers=7)

输入dirname是一个目录路径,为简单起见,该目录路径仅包含2个文件,每个文件包含100多个行.我正在关注异常.

The input dirname is a directory path which has , for the sake of simplicity, only 2 files located with each file containing more than 100 lines. I am getting following Exception.

此外,通过print语句,我可以看到迭代器在目录上迭代了6次.为什么会这样?

Also, with print statement I could see that the iterator iterated over directory 6 times. Why is this so?

任何帮助将不胜感激.

推荐答案

它看起来像是一个文本示例对象,其形状应类似于TaggedDocument(具有wordstags属性,以前称为LabeledSentence),而不是普通字符串.您是否100%确定屏幕快照中的错误完全是由您包含的可迭代代码引起的? (这里的代码看起来只能发出可接受的LabeledSentece对象.)

It looks like one of the text-example objects, which should be shaped like a TaggedDocument (with words and tags properties, formerly called LabeledSentence), is somehow a plain string instead. Are you 100% certain that the error in your screenshot was generated by exactly the iterable code you've included? (The code here looks like it could only emit acceptable LabeledSentece objects.)

对提供的语料库Iterable进行一次读取以进行初始扫描,以发现所有单词/标记,然后再次多次进行训练. iter参数控制多少次,默认值(在gensim的最新版本中)为5.因此,初始扫描加上5次训练将等于6次总迭代. (在Doc2Vec中,通常有10次以上的迭代.)

Your supplied corpus Iterable is read once to do an initial scan which discovered all words/tags, then again multiple times for training. How many times is controlled by the iter parameter, with a default value (in recent versions of gensim) of 5. So the initial scan plus 5 training passes equal 6 total iterations. (10 or more iterations is common with Doc2Vec.)

这篇关于Gensim Doc2Vec异常AttributeError:'str'对象没有属性'words'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆