为什么在使用 Python 的 wordcloud 库时不会从词云中排除停用词? [英] Why are stop words not being excluded from the word cloud when using Python's wordcloud library?

查看:405
本文介绍了为什么在使用 Python 的 wordcloud 库时不会从词云中排除停用词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的词云中不显示The"、The"和My".我正在使用 python 库wordcloud",如下所示,并使用这 3 个额外的停用词更新 STOPWORDS 列表,但 wordcloud 仍然包含它们.我需要更改什么才能排除这 3 个词?

I want to exclude 'The', 'They' and 'My' from being displayed in my word cloud. I'm using the python library 'wordcloud' as below, and updating the STOPWORDS list with these 3 additional stopwords, but the wordcloud is still including them. What do I need to change so that these 3 words are excluded?

我导入的库是:

import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

我已经尝试在下面的 STOPWORDS 集中添加元素,但是,即使成功添加了单词,wordcloud 仍然显示我添加到 STOPWORDS 集中的 3 个单词:

I've tried adding elements to the STOPWORDS set at follows but, even though the words are added successfully, the wordcloud still shows the 3 words I added to the STOPWORDS set:

len(STOPWORDS)输出:192

然后我跑了:

STOPWORDS.add('The')
STOPWORDS.add('They')
STOPWORDS.add('My')

然后我跑了:

len(STOPWORDS)输出:195

我正在运行 Python 3.7.3 版

I'm running python version 3.7.3

我知道我可以在运行 wordcloud 之前修改文本输入以删除 3 个单词(而不是尝试修改 WordCloud 的 STOPWORDS 集),但我想知道 WordCloud 是否存在错误,或者我是否没有更新/使用 STOPWORDS正确吗?

I know I could amend the text input to remove the 3 words (rather than trying to amend WordCloud's STOPWORDS set) before running the wordcloud but I was wondering if there's a bug with WordCloud or whether I'm not updating/using STOPWORDS correctly?

推荐答案

Wordcloud 的默认设置是 collocations=True,因此云中包含两个相邻单词的频繁短语 - 并且重要的是对于您的问题,使用搭配去除停用词是不同的,因此例如谢谢"是有效搭配,即使您"在默认停用词中,也可能出现在生成的云中.删除仅包含停用词的搭配.

The default for a Wordcloud is that collocations=True, so frequent phrases of two adjacent words are included in the cloud - and importantly for your issue, with collocations the removal of stopwords is different, so that for example "Thank you" is a valid collocation and may appear in the generated cloud even though "you" is in the default stopwords. Collocations which contain only stopwords are removed.

这听起来不无道理的理由是,如果在构建搭配列表之前删除了停用词,那么例如非常感谢"会提供非常感谢"作为搭配,我绝对不想要.

The not unreasonable-sounding rationale for this is that if stopwords were removed before building the list of collocations then e.g. "thank you very much" would provide "thank very" as a collocation, which I definitely wouldn’t want.

因此,为了让您的停用词按照您的预期工作,即完全没有停用词出现在云中,您可以像这样使用 collocations=False:

So to get your stopwords to work perhaps how you expect, i.e. no stopwords at all appear in the cloud, you could use collocations=False like this:

my_wordcloud = WordCloud(
    stopwords=my_stopwords,
    background_color='white', 
    collocations=False, 
    max_words=10).generate(all_tweets_as_one_string)

更新:

  • 如果搭配为 False,停用词全部小写,以便在删除时与小写文本进行比较 - 因此无需添加The"等.
  • 使用搭配 True(默认),而停用词是小写的,当寻找全停用词搭配来删除它们时,源文本不是小写的,所以文本中的 egg The 不是't 在 the 被删除时被删除 - 这就是@Balaji Ambresh 的代码有效的原因,你会看到云中没有上限.这可能是 Wordcloud 中的一个缺陷,不确定.但是添加例如 到停用词不会影响这一点,因为停用词总是小写,而不管搭配 True/False
  • With collocations False, stopwords are all lowercased for comparison with lowercased text when removing them - so no need to add 'The' etc.
  • With collocations True (the default) while stopwords are lowercased, when looking for all-stopwords collocations to remove them, the source text isn't lower-cased so e.g.g The in the text isn't removed while the is removed - that's why @Balaji Ambresh's code works, and you'll see that there are no caps in the cloud. This might be a defect in Wordcloud, not sure. However adding e.g. The to stopwords won't affects this because stopwords is always lowercased regardless of collocations True/False

这在源代码中都是可见的:-)

This is all visible in the source code :-)

例如使用默认的 collocations=True 我得到:

For example with the default collocations=True I get:

使用 collocations=False 我得到:

代码:

from wordcloud import WordCloud
from matplotlib import pyplot as plt


text = "The bear sat with the cat. They were good friends. " + \
        "My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
        "there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
            "It was such a lovely day. The bear was loving it too."

cloud = WordCloud(collocations=False,
        background_color='white',
        max_words=10).generate(text)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

这篇关于为什么在使用 Python 的 wordcloud 库时不会从词云中排除停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆