使用 nltk 从德语文本中提取单词 [英] Extracting Words using nltk from German Text

查看:47
本文介绍了使用 nltk 从德语文本中提取单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从德语文档中提取单词,当我使用 nltk 教程中描述的以下方法时,我无法获取带有语言特定特殊字符的单词.

I am trying to extract words from a german document, when I use th following method as described in the nltk tutorial, I fail to get the words with language specific special characters.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*');
words = nltk.Text(ptcr.words(DocumentName))

如何获取文档中的单词列表?

What should I do to get the list of words in the document?

使用 nltk.tokenize.WordPunctTokenizer() 表示德语短语 Veränderungen über einen Walzer 的示例如下所示:

An example with nltk.tokenize.WordPunctTokenizer() for the german phrase Veränderungen über einen Walzer looks like:

In [231]: nltk.tokenize.WordPunctTokenizer().tokenize(u"Veränderungen über einen Walzer")

Out[231]: [u'Ver\xc3', u'\xa4', u'nderungen', u'\xc3\xbcber', u'einen', u'Walzer']

在此示例中,ä"被视为分隔符,即使ü"不是.

In this example "ä" is treated as a delimiter,even though "ü" is not.

推荐答案

使用参数 encoding='utf-8' 调用 PlaintextCorpusReader:

Call PlaintextCorpusReader with the parameter encoding='utf-8':

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8')

我明白了……你在这里有两个不同的问题:

I see... you have two separate problems here:

a) 标记化问题:当您使用来自德语的文字字符串进行测试时,你认为你是输入unicode.实际上,您是在告诉 python 获取字节在引号之间并将它们转换为 unicode 字符串.但你的字节正在误解了.修复:在最顶部添加以下行源文件.

a) Tokenization problem: When you test with a literal string from German, you think you are entering unicode. In fact you are telling python to take the bytes between the quotes and convert them into a unicode string. But your bytes are being misinterpreted. Fix: Add the following line at the very top of your source file.

# -*- coding: utf-8 -*-

突然之间您的常量将被正确地看到和标记:

All of a sudden your constants will be seen and tokenized correctly:

german = u"Veränderungen über einen Walzer"
print nltk.tokenize.WordPunctTokenizer().tokenize(german)

第二个问题:原来Text() 没有使用unicode!如果你传递给它一个 unicode 字符串,它会尝试将其转换为纯 ASCII字符串,当然在非 ascii 输入上失败.啊.

Second problem: It turns out that Text() does not use unicode! If you pass it a unicode string, it will try to convert it to a pure-ascii string, which of course fails on non-ascii input. Ugh.

解决方案:我的建议是完全避免使用 nltk.Text,而直接使用语料库阅读器.(这通常是个好主意:请参阅 nltk.Text 自己的文档).

Solution: My recommendation would be to avoid using nltk.Text entirely, and work with the corpus readers directly. (This is in general a good idea: See nltk.Text's own documentation).

但是,如果您必须nltk.Text 与德国数据一起使用,那么方法如下:阅读您的数据正确,以便它可以被标记,然后编码"你的 unicode 回到 str 的列表.对于德语,它是仅使用 Latin-1 编码可能最安全,但 utf-8 似乎有效

But if you must use nltk.Text with German data, here's how: Read your data properly so it can be tokenized, but then "encode" your unicode back to a list of str. For German, it's probably safest to just use the Latin-1 encoding, but utf-8 seems to work too.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8');

# Convert unicode to utf8-encoded str
coded = [ tok.encode('utf-8') for tok in ptcr.words(DocumentName) ]
words = nltk.Text(coded)

这篇关于使用 nltk 从德语文本中提取单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆