在Python上的WordCloud中,我想合并两种语言 [英] In WordCloud on Python I would like to merge two languages

查看:336
本文介绍了在Python上的WordCloud中,我想合并两种语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python上的WordCloud中,我想将两种语言合并为一张图片(英语,阿拉伯语),但是当您看到一个正方形而不是单词时,并且当我调用Arabic_reshaper库并进行创建时,我无法添加阿拉伯语它读取了csv文件,它向我展示了阿拉伯语,并将英语变成了正方形

In WordCloud on Python I would like to merge two languages ​​into one picture (English, Arabic) but I was unable to add the Arabic language as you see a squares instead of words and when I call the Arabic_reshaper library and make it read the csv file It shows me the Arabic language and make the English language as a squares

    wordcloud = WordCloud(
                          collocations = False,
                          width=1600, height=800,
                          background_color='white',
                          stopwords=stopwords,
                          max_words=150,
                          random_state=42,
                          #font_path='/Users/mac/b.TTF'
                         ).generate(' '.join(df['body_new']))
print(wordcloud)
plt.figure(figsize=(9,8))
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

给她看了两种语言,但看到了正方形而不是阿拉伯语

不会喜欢这种最多两种语言

推荐答案

一段时间以来,我一直在同一个问题上挣扎,而解决它的最佳方法是generate_from_frequencies()函数.您还需要正确的阿拉伯字体. "Shorooq"可以正常运行,并且可以免费在线获得.这是您的代码的快速修复:

I've been struggling with the same problem for a while now and the best way to deal with it is the generate_from_frequencies() function. You also need a proper font for Arabic. 'Shorooq' will work fine and available online for free. Here is a quick fix to your code:

from arabic_reshaper import arabic_reshaper
from bidi.algorithm import get_display
from nltk.corpus import stopwords
from itertools import islice


text = " ".join(line for lines in df['body_new'])
stop_ar = stopwords.words('arabic') 
# add more stop words here like numbers, special characters, etc. It should be customized for your project

top_words = {}
words = text.split()
for w in words:
    if w in stop_ar:
        continue
    else:
        if w not in top_words:
            top_words[w] = 1
        else:
            top_words[w] +=1

# Sort the dictionary of the most frequent words
top_words = {k: v for k, v in sorted(top_words.items(), key=lambda item: item[1], reverse = True)}

# select the first 150 most frequent words
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))
for_wc = take(150, top_words.items())

# you need to reshape your words to be shown properly and turn the result into a dictionary
dic_data = {}
for t in for_wc:
    r = arabic_reshaper.reshape(t[0]) # connect Arabic letters
    bdt = get_display(r) # right to left
    dic_data[bdt] = t[1] 

# Plot
wc = WordCloud(background_color="white", width=1600, height=800,max_words=400, font_path='fonts/Shoroq.ttf').generate_from_frequencies(dic_data)
plt.figure(figsize=(16,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

重要提示:

get_display()reshape()可能会给您错误.这是因为您的文本中有一个奇怪的字符,这些功能无法处理.但是,找到它并不难,因为您只需要在绘图中使用150个单词即可显示.找到它并将其添加到您的停用词,然后重新运行代码.

get_display() or reshape() might give you error. It is because there is a weird character in your text that these functions are unable to deal with. However finding it should not be so difficult as you only use 150 words to display in your plot. Find it and add it to your Stop Words and rerun the code.

这篇关于在Python上的WordCloud中,我想合并两种语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆