使用python创建n-grams词云 [英] Creating n-grams word cloud using python

查看:61
本文介绍了使用python创建n-grams词云的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用双元词生成词云.我能够生成前 30 个判别性单词,但在绘图时无法将单词一起显示.我的词云图像仍然看起来像一个 uni-gram 云.我使用了以下脚本和 sci-kit 学习包.

I am trying to generate word cloud using bi-grams. I am able to generate the top 30 discriminative words but unable to display words together while plotting. My word cloud image still looks like a uni-gram cloud. I have used the following script and sci-kit learn packages.

def create_wordcloud(pipeline): 
"""
Create word cloud with top 30 discriminative words for each category
"""

class_labels = numpy.array(['Arts','Music','News','Politics','Science','Sports','Technology'])

feature_names =pipeline.named_steps['vectorizer'].get_feature_names() 
word_text=[]

for i, class_label in enumerate(class_labels):
    top30 = numpy.argsort(pipeline.named_steps['clf'].coef_[i])[-30:]

    print("%s: %s" % (class_label," ".join(feature_names[j]+"," for j in top30)))

    for j in top30:
        word_text.append(feature_names[j])
    #print(word_text)
    wordcloud1 = WordCloud(width = 800, height = 500, margin=10,random_state=3, collocations=True).generate(' '.join(word_text))

    # Save word cloud as .png file
    # Image files are saved to the folder "classification_model" 
    wordcloud1.to_file(class_label+"_wordcloud.png")

    # Plot wordcloud on console
    plt.figure(figsize=(15,8))
    plt.imshow(wordcloud1, interpolation="bilinear")
    plt.axis("off")
    plt.show()
    word_text=[]

这是我的管道代码

pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(2, 2),sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3))
])

这些是我为艺术"类别获得的一些功能

These are some of the features I got for the category "Arts"

Arts: cosmetics businesspeople, television personality, reality television, television presenters, actors london, film producers, actresses television, indian film, set index, actresses actresses, television actors, century actors, births actors, television series, century actresses, actors television, stand comedian, television personalities, television actresses, comedian actor, stand comedians, film actresses, film actors, film directors

推荐答案

我认为您需要以某种方式将 feature_names 中的 n-gramms 与除空格以外的任何其他符号连接起来.例如,我建议下划线.现在,这部分让你的 n-grams 再次分离单词,我认为:

I think you need somehow to join your n-gramms in feature_names with any other symbol than space. I propose underscore, for example. For now, this part makes your n-gramms separate words again, I think:

' '.join(word_text)

我认为您必须在此处用下划线替换空格:

I think you have to substitute space with underscore here:

word_text.append(feature_names[j])

改成这样:

word_text.append(feature_names[j].replace(' ', '_'))

这篇关于使用python创建n-grams词云的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆