NLTK法语文本上的word_tokenize无法正常唤醒 [英] NLTK word_tokenize on French text is not woking properly

查看:358
本文介绍了NLTK法语文本上的word_tokenize无法正常唤醒的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过以下方式在法语中的文本上使用NLTK word_tokenize:

I'm trying to use NLTK word_tokenize on a text in French by using :

txt = ["Le télétravail n'aura pas d'effet sur ma vie"]
print(word_tokenize(txt,language='french'))

它应该打印:

['Le', 'télétravail', 'n'','aura', 'pas', 'd'','effet', 'sur', 'ma', 'vie','.']

但是我得到:

['Le', 'télétravail', "n'aura", 'pas', "d'effet", 'sur', 'ma', 'vie','.']

有人知道为什么在法语中进行NLP时,为什么不能用法语正确地拆分令牌以及如何克服这一问题(以及其他潜在问题)?

Does anyone know why it's not spliting tokens properly in French and how to overcome this (and other potential issues) when doing NLP in French?

推荐答案

通过查看word_tokenize的源可以发现,language参数仅用于确定如何将输入拆分为句子. 对于单词级别的标记化,使用了(略微修改的)TreebankWordTokenizer,它最适合 english 输入和诸如 not 的收缩. 来自nltk/tokenize/__init__.py:

Looking at the source of word_tokenize reveals, that the language argument is only used to determine how to split the input into sentences. And for tokenization on word level, a (slightly modified) TreebankWordTokenizer is used which will work best for english input and contractions like don't. From nltk/tokenize/__init__.py:

_treebank_word_tokenizer = TreebankWordTokenizer()
# ... some modifications done
def word_tokenize(text, language='english', preserve_line=False):
    # ...
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

要获得所需的输出,您可能需要考虑使用其他标记符,例如RegexpTokenizer,如下所示:

To get your desired output, you might want to consider using a different tokenizer like a RegexpTokenizer as following:

txt = "Le télétravail n'aura pas d'effet sur ma vie"
pattern = r"[dnl]['´`]|\w+|\$[\d\.]+|\S+"
tokenizer = RegexpTokenizer(pattern)
tokenizer.tokenize(txt)
# ['Le', 'télétravail', "n'", 'aura', 'pas', "d'", 'effet', 'sur', 'ma', 'vie']

我对法语的了解有限,仅能解决上述问题.对于其他情况,您将不得不调整模式. 您还可以查看TreebankWordTokenizer的实现,以获取更复杂的解决方案的想法. 另外请记住,这样,如果需要,您将需要预先拆分句子.

My knowledge of French is limited and this only solves the stated problem. For other cases you will have to adapt the pattern. You can also look at the implementation of the TreebankWordTokenizer for ideas of a more complex solution. Also keep in mind that this way you will need to split sentences beforehand, if necessary.

这篇关于NLTK法语文本上的word_tokenize无法正常唤醒的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆