在CSV中检查语言的快速方法 [英] Fast way of checking for language in csv
问题描述
我有一个庞大的csv文件,一列包含具有不同语言的摘要的行.我的目标是整理那些不是用英语写的段落. 我不介意某些单词排序错误.
I have a huge csv-file and one column has rows with summaries in different languages. My goal is to sort out those paragraphs that are not written in english. I don't mind if some words get sorted wrong.
我当前的代码正在运行,但是由于我仍然是一个初学者,我担心..这实际上并没有达到最高速度.这意味着要花很长时间,因为我有大约8万行,我想下周我仍会坐在这里等待. 我已经检查了解决方案,但没有找到对我有用的任何东西,因为所使用的langdetections似乎仅用于少量数据.
My current code is working but as I'm still a beginner I fear ..it's not really up to speed. Meaning it takes very long this way and as I have about 80k rows I guess I'd still be sitting here next week waiting. I've checked for solutions but didn't find anything that worked for me, since the langdetections used seemed to be for a small amount of data.
import langdetect
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize
test = pd.read_csv("file.csv", sep='\t',header=0,index_col=False, quoting=csv.QUOTE_NONE, usecols = ("TI","AB","PY","DI"),dtype = str)
stop_e = stopwords.words('english')
worte = gutenberg.words()
for line in test["AB"]:
if type(line) == str:
tokens = word_tokenize(line)
for token in tokens:
if token.isalpha()and token not in stop_e and not in worte:
此后,我目前仅打印东西以检查我的代码到目前为止是否正常运行.
After this I'm currently just printing stuff to check if my code is working so far.
编辑.这已经更快了,因为我跳过了纯英文的行.但正如评论中指出的那样:由于我不知道如何删除整个段落,我仍在按单词删除.
Edit. This is faster already, since I skip rows that are purely english. But as was pointed out in the comments: I'm still deleting by word, as I don't know how to remove entire paragraphs.
for line in alle["AB"]:
if type(line) == str:
if detect(line) == 'en':
pass
else:
tokens = word_tokenize(line)
for token in tokens:
if token.isalpha()and token not in stop_e and token not in worte:
#del word
您有什么改善的想法吗?我想我的问题是每个单词都经过整个古腾堡语料库的检查.但是有没有更快的方法呢?
Do you have any ideas for improvement? I guess my problem is that every word is checked with the whole Gutenberg-corpus..but is there a faster way to do this?
使用from nltk.corpus import words
作为语料库代替Gutenberg似乎更快一些,但效果不明显.
Using from nltk.corpus import words
as a corpus instead of Gutenberg seems to be a bit faster but not significantly.
我的数据框示例. AB的摘要在这里都是英文,但是我想把任何加入到csv中的德语/西班牙语/其他语言都扔掉.
Sample of my dataframe. The summaries in AB are all english here but I want to throw out any german/spanish/others that made it into the csv.
TI AB PY DI
83009 Disability and inclusive education in times of... When communities fall into decline, disabled p... 2014 10.1080/01425692.2014.919845
83010 Transforming marginalised adult learners' view... Adult learners on Access to Higher Education c... 2014 10.1080/01425692.2014.919842
83011 Home education, school, Travellers and educati... The difficulties Traveller pupils experience i... 2014 10.1080/01425692.2014.919840
83012 Promoting online deliberation quality: cogniti... This research aims to contribute to the theory... 2014 10.1080/1369118X.2014.899610
83013 Living in an age of online incivility: examini... Communication scholars have examined the poten... 2014 10.1080/1369118X.2014.899609
推荐答案
您提到的注释中希望删除整个段落.这是我的处理方式.在您的第一个代码段中,您import langdetect
,但实际上并没有使用它. langdetect.dectect()
可以接受整个字符串.您无需拆分单词.示例:
From the comments you mentioned wanted to remove the entire paragraph. Here is how I would handle it. In your first code-snippet you import langdetect
but do not actually use it. langdetect.dectect()
can take an entire string. You do not need to split the words. Example:
langdetect.detect('using as example')
# output
'en'
通过不将整个字符串拆分成单个单词,这将减少时间.这是由于每个单词未调用detect()
的缘故.这是我如何解决的小样本:
By not splitting the entire string into single words this will cut down on the time. This is due to detect()
not being called for each word. Here is a small sample of how I would tackle it:
import pandas as pd
import langdetect
# creating a sample dataframe
df1 = pd.DataFrame({'Sentence':['es muy bueno','run, Forest! Run!','Ήξερα ότι θα εξετάζατε τον Μεταφραστή Google', 'This is Certainly en']})
# calling detect on each sentence
df1['Language'] = df1['Sentence'].apply(lambda x: langdetect.detect(x))
# filtering the entire dataset for only english
filtered_for_english = df1.loc[df1['Language'] == 'en']
# output
Sentence Language
3 This is Certainly en en
但这是使用langdetect
的缺点...这是Google从Java语言检测到Python的端口根据文档.译者并不总是正确的:
But here is the downside to using langdetect
... it is a Port from Google's language detection from Java to Python According to the Docs. Translators are not always correct:
看看通过langdetect.detect('run, Forest! Run!')
传递的英语电影 Forest Gump 中的流行短语.这将返回ro
罗马尼亚语.您可以尝试删除标点符号,停用词,词干和词根化,或者只是删除有问题的名词/动词以获得更准确的阅读结果.这些是您需要进行自我测试的东西.
Look at the popular phrase from the english movie Forest Gump passed through langdetect.detect('run, Forest! Run!')
. this returns ro
for Romanian. You can try removing punctuation, stop-words, stemming and lemmatizing, or simply remove problematic nouns/verbs to get a more accurate reading. These are things you will need to test yourself.
这篇关于在CSV中检查语言的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!