在CSV中检查语言的快速方法 [英] Fast way of checking for language in csv

查看:201
本文介绍了在CSV中检查语言的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个庞大的csv文件,一列包含具有不同语言的摘要的行.我的目标是整理那些不是用英语写的段落. 我不介意某些单词排序错误.

I have a huge csv-file and one column has rows with summaries in different languages. My goal is to sort out those paragraphs that are not written in english. I don't mind if some words get sorted wrong.

我当前的代码正在运行,但是由于我仍然是一个初学者,我担心..这实际上并没有达到最高速度.这意味着要花很长时间,因为我有大约8万行​​,我想下周我仍会坐在这里等待. 我已经检查了解决方案,但没有找到对我有用的任何东西,因为所使用的langdetections似乎仅用于少量数据.

My current code is working but as I'm still a beginner I fear ..it's not really up to speed. Meaning it takes very long this way and as I have about 80k rows I guess I'd still be sitting here next week waiting. I've checked for solutions but didn't find anything that worked for me, since the langdetections used seemed to be for a small amount of data.

import langdetect
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize


test = pd.read_csv("file.csv", sep='\t',header=0,index_col=False, quoting=csv.QUOTE_NONE, usecols = ("TI","AB","PY","DI"),dtype = str)

stop_e = stopwords.words('english')
worte = gutenberg.words()

for line in test["AB"]:
    if type(line) == str:
        tokens = word_tokenize(line)
    for token in tokens:
        if token.isalpha()and token not in stop_e and not in worte:

此后,我目前仅打印东西以检查我的代码到目前为止是否正常运行.

After this I'm currently just printing stuff to check if my code is working so far.

编辑.这已经更快了,因为我跳过了纯英文的行.但正如评论中指出的那样:由于我不知道如何删除整个段落,我仍在按单词删除.

Edit. This is faster already, since I skip rows that are purely english. But as was pointed out in the comments: I'm still deleting by word, as I don't know how to remove entire paragraphs.

for line in alle["AB"]:
    if type(line) == str:
        if detect(line) == 'en':
            pass
        else:
            tokens = word_tokenize(line)
            for token in tokens:
                if token.isalpha()and token not in stop_e and token not in worte: 
#del word

您有什么改善的想法吗?我想我的问题是每个单词都经过整个古腾堡语料库的检查.但是有没有更快的方法呢?

Do you have any ideas for improvement? I guess my problem is that every word is checked with the whole Gutenberg-corpus..but is there a faster way to do this?

使用from nltk.corpus import words作为语料库代替Gutenberg似乎更快一些,但效果不明显.

Using from nltk.corpus import words as a corpus instead of Gutenberg seems to be a bit faster but not significantly.

我的数据框示例. AB的摘要在这里都是英文,但是我想把任何加入到csv中的德语/西班牙语/其他语言都扔掉.

Sample of my dataframe. The summaries in AB are all english here but I want to throw out any german/spanish/others that made it into the csv.

TI  AB  PY  DI
83009   Disability and inclusive education in times of...   When communities fall into decline, disabled p...   2014    10.1080/01425692.2014.919845
83010   Transforming marginalised adult learners' view...   Adult learners on Access to Higher Education c...   2014    10.1080/01425692.2014.919842
83011   Home education, school, Travellers and educati...   The difficulties Traveller pupils experience i...   2014    10.1080/01425692.2014.919840
83012   Promoting online deliberation quality: cogniti...   This research aims to contribute to the theory...   2014    10.1080/1369118X.2014.899610
83013   Living in an age of online incivility: examini...   Communication scholars have examined the poten...   2014    10.1080/1369118X.2014.899609

推荐答案

您提到的注释中希望删除整个段落.这是我的处理方式.在您的第一个代码段中,您import langdetect,但实际上并没有使用它. langdetect.dectect()可以接受整个字符串.您无需拆分单词.示例:

From the comments you mentioned wanted to remove the entire paragraph. Here is how I would handle it. In your first code-snippet you import langdetect but do not actually use it. langdetect.dectect() can take an entire string. You do not need to split the words. Example:

langdetect.detect('using as example')
# output
'en'

通过不将整个字符串拆分成单个单词,这将减少时间.这是由于每个单词未调用detect()的缘故.这是我如何解决的小样本:

By not splitting the entire string into single words this will cut down on the time. This is due to detect() not being called for each word. Here is a small sample of how I would tackle it:

import pandas as pd
import langdetect
# creating a sample dataframe
df1 = pd.DataFrame({'Sentence':['es muy bueno','run, Forest! Run!','Ήξερα ότι θα εξετάζατε τον Μεταφραστή Google', 'This is Certainly en']})
# calling detect on each sentence
df1['Language'] = df1['Sentence'].apply(lambda x: langdetect.detect(x))
# filtering the entire dataset for only english
filtered_for_english = df1.loc[df1['Language'] == 'en']
# output
               Sentence Language
3  This is Certainly en       en

但这是使用langdetect的缺点...这是Google从Java语言检测到Python的端口根据文档.译者并不总是正确的:

But here is the downside to using langdetect... it is a Port from Google's language detection from Java to Python According to the Docs. Translators are not always correct:

看看通过langdetect.detect('run, Forest! Run!')传递的英语电影 Forest Gump 中的流行短语.这将返回ro罗马尼亚语.您可以尝试删除标点符号,停用词,词干和词根化,或者只是删除有问题的名词/动词以获得更准确的阅读结果.这些是您需要进行自我测试的东西.

Look at the popular phrase from the english movie Forest Gump passed through langdetect.detect('run, Forest! Run!'). this returns ro for Romanian. You can try removing punctuation, stop-words, stemming and lemmatizing, or simply remove problematic nouns/verbs to get a more accurate reading. These are things you will need to test yourself.

这篇关于在CSV中检查语言的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆