在 Python 上的 WordCloud 我想合并两种语言 [英] In WordCloud on Python I would like to merge two languages

查看:32
本文介绍了在 Python 上的 WordCloud 我想合并两种语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 上的 WordCloud 中,我想将两种语言合并到一张图片中(英语、阿拉伯语),但是我无法添加阿拉伯语,因为您看到的是正方形而不是单词,并且当我调用 Arab_reshaper 库并制作它读取 csv 文件它向我显示阿拉伯语并将英语作为正方形

 wordcloud = WordCloud(搭配 = 错误,宽度=1600,高度=800,background_color='白色',停用词=停用词,max_words=150,随机状态=42,#font_path='/Users/mac/b.TTF').generate(' '.join(df['body_new']))打印(wordcloud)plt.figure(figsize=(9,8))fig = plt.figure(1)plt.imshow(wordcloud)plt.axis('关闭')plt.show()

解决方案

我一直在为同样的问题苦苦挣扎,最好的解决方法是 generate_from_frequencies()功能.您还需要适合阿拉伯语的字体.'Shorooq' 可以正常工作并且可以免费在线获取.这是对您的代码的快速修复:

from arabic_reshaper import arabic_reshaperfrom bidi.algorithm import get_display从 nltk.corpus 导入停用词从 itertools 导入 islicetext = " ".join(line for lines in df['body_new'])stop_ar = stopwords.words('阿拉伯语')# 在此处添加更多停用词,如数字、特殊字符等.应该为您的项目定制top_words = {}单词 = text.split()对于 w 的话:如果 w 在 stop_ar 中:继续别的:如果 w 不在 top_words 中:顶字[w] = 1别的:top_words[w] +=1# 对最常用词的字典进行排序top_words = {k: v for k, v in sorted(top_words.items(), key=lambda item: item[1], reverse = True)}# 选择前 150 个最常用的词def take(n, iterable):将可迭代对象的前 n 个项目作为列表返回"返回列表(islice(可迭代,n))for_wc = take(150, top_words.items())# 你需要重塑你的单词才能正确显示并将结果变成字典dic_data = {}对于 for_wc 中的 t:r = arabic_reshaper.reshape(t[0]) # 连接阿拉伯字母bdt = get_display(r) # 从右到左dic_data[bdt] = t[1]# 阴谋wc = WordCloud(background_color="white", width=1600, height=800,max_words=400, font_path='fonts/Shoroq.ttf').generate_from_frequencies(dic_data)plt.figure(figsize=(16,8))plt.imshow(wc, 插值='双线性')plt.axis(关闭")plt.show()

重要提示:

get_display()reshape() 可能会给你错误.这是因为您的文本中存在这些功能无法处理的奇怪字符.然而,找到它应该不难,因为您只使用 150 个单词来显示在您的情节中.找到它并将其添加到您的停用词中,然后重新运行代码.

In WordCloud on Python I would like to merge two languages ​​into one picture (English, Arabic) but I was unable to add the Arabic language as you see a squares instead of words and when I call the Arabic_reshaper library and make it read the csv file It shows me the Arabic language and make the English language as a squares

    wordcloud = WordCloud(
                          collocations = False,
                          width=1600, height=800,
                          background_color='white',
                          stopwords=stopwords,
                          max_words=150,
                          random_state=42,
                          #font_path='/Users/mac/b.TTF'
                         ).generate(' '.join(df['body_new']))
print(wordcloud)
plt.figure(figsize=(9,8))
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

see her a put two languages ,but see a squares instead of words arabic

a wont like this max two languages

解决方案

I've been struggling with the same problem for a while now and the best way to deal with it is the generate_from_frequencies() function. You also need a proper font for Arabic. 'Shorooq' will work fine and available online for free. Here is a quick fix to your code:

from arabic_reshaper import arabic_reshaper
from bidi.algorithm import get_display
from nltk.corpus import stopwords
from itertools import islice


text = " ".join(line for lines in df['body_new'])
stop_ar = stopwords.words('arabic') 
# add more stop words here like numbers, special characters, etc. It should be customized for your project

top_words = {}
words = text.split()
for w in words:
    if w in stop_ar:
        continue
    else:
        if w not in top_words:
            top_words[w] = 1
        else:
            top_words[w] +=1

# Sort the dictionary of the most frequent words
top_words = {k: v for k, v in sorted(top_words.items(), key=lambda item: item[1], reverse = True)}

# select the first 150 most frequent words
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))
for_wc = take(150, top_words.items())

# you need to reshape your words to be shown properly and turn the result into a dictionary
dic_data = {}
for t in for_wc:
    r = arabic_reshaper.reshape(t[0]) # connect Arabic letters
    bdt = get_display(r) # right to left
    dic_data[bdt] = t[1] 

# Plot
wc = WordCloud(background_color="white", width=1600, height=800,max_words=400, font_path='fonts/Shoroq.ttf').generate_from_frequencies(dic_data)
plt.figure(figsize=(16,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Important:

get_display() or reshape() might give you error. It is because there is a weird character in your text that these functions are unable to deal with. However finding it should not be so difficult as you only use 150 words to display in your plot. Find it and add it to your Stop Words and rerun the code.

这篇关于在 Python 上的 WordCloud 我想合并两种语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆