python中的文本语言检测 [英] Text Language detection in python

查看:45
本文介绍了python中的文本语言检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试检测可能由未知数量的语言组成的文本的语言。以下代码为我提供了不同的语言作为答案
注意:我减少了评论,因为它在帖子期间不允许输入错误

I am trying to detect the language of the text that may consist of an unknown number of languages. The following code gives me different languages as answer NOTE: I reduced the review becuase it was giving the error during post "" are not allowed

print(detect(كانت جميله وممتعة للأطفال اولا حيث اماكن اللعبر))
print(detect(的马来西亚))
print(detect(Vi havde 2 perfekte dage i Legoland Malaysia))
print(detect(Wij hebben alleen gekozen voor het waterpark maar daar ben je vrijs snel doorheen. Super leuke glijbanen en overal ruimte om te zitten en te liggen. Misschien volgende keer een gecombineerd ticket kopen met ook toegang tot waterpark))
print(detect(This is a park thats just ok, nothing great to write home about.  There is barely any shade, the weather is always really hot so they need to take this into consideration. The atractions are just meh. I would only go if you are a fan of lego, for the sculptures are nice.))

这是输出

ar
zh-cn
da
nl
en

但是使用以下循环,所有评论都给了我结果为'en'

But using the following loop, all reviews give me 'en' as result

from langdetect import detect
import pandas as pd
df = pd.read_excel('data.xls') #
lang = []    
for r in df.Review:
    lang = detect(r)
    df['Languagereveiw'] = lang

所有五行的输出均为 en。

the output is 'en' for all five rows.

需要指导,丢失链在哪里?

Need guidance that where is the missing chain?

以下是示例数据

第二,如何获得语言的完整名称,例如英语'en'

Secondly, How can I get the complete name of languages i.e. English for 'en'

推荐答案

在循环中,您这样做是覆盖了整个列:

In your loop you're overwriting the entire column by doing this:

df['Languagereveiw'] = lang

如果要在for循环中执行此操作,请使用 items

If you want to do this in a for loop use iteritems:

for index, row in df['Review'].iteritems():
    lang = detect(row) #detecting each row
    df.loc[index, 'Languagereveiw'] = lang

但是,您可以抛开循环,只做

however, you can just ditch the loop and just do

df['Languagereveiw'] = df['Review'].apply(detect)

哪一种语法糖可以在整个列上执行您的函数

Which is syntactic sugar to execute your func on the entire column

关于您从语言代码转换为完整描述的后一个问题:

Regarding your latter question about converting from language code to full description:

'en'到'english',

'en' to 'english',

查看 polyglot

这提供了检测语言,获取语言代码和完整说明的功能

this provides the facility to detect language, get the language code, and the full description

这篇关于python中的文本语言检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆