删除非英语数据 [英] Removing the non-english data

查看:65
本文介绍了删除非英语数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据中有一些非英语单词/句子.我标记了我的文字,并尝试使用 nltk.corpus.words.words(),但它并没有真正的帮助,因为它还会删除品牌名称,公司名称(例如NLTK等).我需要一些可靠的解决方案目的.

I have some non-english words/sentences in my data. I tokenized my text and tried using nltk.corpus.words.words() but its not really helpful as it also removes the brand names, company names, like NLTK etc. I need some solid solution for the purpose.

这是我尝试过的:

def removeNonEnglishWordsFunct(x):
    words = set(nltk.corpus.words.words())
    filteredSentence = " ".join(w for w in nltk.wordpunct_tokenize(x) \
                                if w.lower() in words or not w.isalpha())
    return filteredSentence


string = "NLTK testing man Apple Confiz Burj Al Arab Copacabana Palace Wは比較的新しくてきれいなのですが Sheraton hotelは時々 NYらしい小さくて清潔感のない部屋"

res = removeNonEnglishWordsFunct(string)
Output: testing man Apple Al Palace

Expected output: NLTK testing man Apple Confiz Burj Al Arab Copacabana Palace Sheraton hotel

推荐答案

您有点想在这里实现不可能,您希望它变得智能".

You are kind of asking for the impossible here, you want it to be 'smart'.

我们可以对您想要的东西进行猜测,但是我们所能做的就是接近可能正确的事物,我们将永远不会处理所有极端情况.

We can make guesses about the kind of thing you want but all we can do is get closer to something that may be right, we will never deal with all the edge cases.

例如.假设任何以大写字母开头的单词都是首字母缩写词:

For example. Lets assume any word starting with a capital letter is an acronym:

def wordIsRomanChars(w):
    return w[0].upper() and all([ord(c) <128 or (ord(c) >= 65313 and ord(c) <= 65339) or (ord(c) >= 65345 and ord(c) <= 65371) for c in w])


def removeNonEnglishWordsFunc2(x):
    words = set(nltk.corpus.words.words())
    filteredSentence = " ".join(w for w in nltk.wordpunct_tokenize(x) \
                                if w.lower() in words or not w.isalpha() or wordIsRomanChars(w))
    return filteredSentence


string = "NLTK testing man Apple Confiz Burj Al Arab Copacabana Palace Wは比較的新しくてきれいなのですが Sheraton hotelは時々 NYらしい小さくて清潔感のない部屋"

res = removeNonEnglishWordsFunc2(string)
print(res)
"Gives: NLTK testing man Apple Confiz Burj Al Arab Copacabana Palace Sheraton"    

这是一个不错的开始,但是找不到与非罗马字符相连的酒店".

This is a good start but it doesn't find the 'hotel' as that is attached to none-roman characters.

我们可以通过忽略非罗马字符来解决这个问题:

We can get round this by ignoring none-roman characters:

def takeCharsUntilNotRoman(w):
    result = []
    for c in w:
        if ord(c) <128 or (ord(c) >= 65313 and ord(c) <= 65339) or (ord(c) >= 65345 and ord(c) <= 65371):
            result.append(c)
        else:
            break
    # Assume a word needs to be at least 2 chars long
    if len(result) > 1:
        return ''.join(result)
    return ''


def removeNonEnglishWordsFunct(x):
    words = set(nltk.corpus.words.words())
    filteredSentence = (takeCharsUntilNotRoman(w) for w in nltk.wordpunct_tokenize(x) \
                                if w.lower() in words or not w.isalpha() or w[0].upper())

    return ' '.join([a for a in filteredSentence if a])

res = removeNonEnglishWordsFunct(string)
print(res)
"Gives: NLTK testing man Apple Confiz Burj Al Arab Copacabana Palace Sheraton hotel NY"

这更接近于您建议的输出,但是在输出中包含了"NY",因为它是从混合的亚洲-罗马字符串中抽出的.可以对逻辑进行进一步调整,但是在不确切知道您需要什么的情况下很难知道.我们将"Al"作为有效字符串包括在内,为什么不包括"NY"?

This is closer to your suggested output but it has included 'NY' in the output as that was pulled out of a mixed asian-roman string. The logic could be further tweaked but it is difficult to know without knowing exactly what you need. We include 'Al' as a valid string so why is 'NY' not included?

您还想问自己一些其他问题:我们是否要在混合的亚洲-罗马字符串中间使用英语单词和首字母缩写词,而不仅仅是开头的单词?

Some other questions you will want to ask yourself are: Would we want english words and acronyms in the middle of a mixed asian-roman string instead of just the words at the beginning ?

我不知道答案,您将不得不对以上内容进行调整,以得出适合您情况的答案.

I do not know the answer to this and you will have to tweak the above to come up with an answer that suits your case.

这篇关于删除非英语数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆