使用nltk标记unicode [英] Tokenizing unicode using nltk

查看:95
本文介绍了使用nltk标记unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些文本文件使用utf-8编码,其中包含诸如ö",ü"等字符.我想从这些文件中解析文本,但是我无法使令牌生成器正常工作.如果我使用标准的nltk标记器:

I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)


输出:[u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k']

朋克令牌生成器似乎做得更好:

Punkt tokenizer seems to do better:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)


输出:[u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

在我无法弄清楚的第一个标记之前仍然有"\ ufeff"(不是我无法将其删除).我究竟做错了什么?非常感谢您的帮助.

There is still '\ufeff' before the first token that i can't figure out (not that I can't remove it). What am I doing wrong? Help greatly appreciated.

推荐答案

\uFEFF字符更有可能是从文件读取的内容的一部分.我怀疑它是由令牌发布者插入的.文件开头的\uFEFF字节顺序标记的不推荐使用的形式.如果它出现在其他地方,则将被视为零宽度不间断空间.

It's more likely that the \uFEFF char is part of the content read from the file. I doubt it was inserted by the tokeniser. \uFEFF at the beginning of a file is a deprecated form of Byte Order Mark. If it appears anywhere else, then it is treated as a zero width non-break space.

Microsoft Notepad编写的文件吗?来自编解码器模块文档:

Was the file written by Microsoft Notepad? From the codecs module docs:

为提高检测UTF-8编码的可靠性,Microsoft为其Notepad程序发明了UTF-8的变体(Python 2.5称为"utf-8-sig"):在任何Unicode字符之前写入文件后,将写入UTF-8编码的BOM(字节序列看起来像这样:0xef,0xbb,0xbf).

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.

尝试使用 codecs.open() 来读取文件.请注意使用BOM的"utf-8-sig"编码.

import codecs
f = codecs.open('C:\Python26\text.txt', 'r', 'utf-8-sig')
text = f.read()
a = nltk.word_tokenize(text)

实验:

>>> open("x.txt", "r").read().decode("utf-8")
u'\ufeffm\xfcsli'
>>> import codecs
>>> codecs.open("x.txt", "r", "utf-8-sig").read()
u'm\xfcsli'
>>> 

这篇关于使用nltk标记unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆