如何调整NLTK句子标记器 [英] How to tweak the NLTK sentence tokenizer

查看:97
本文介绍了如何调整NLTK句子标记器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用NLTK来分析一些经典文本,并且遇到了麻烦来逐句标记文本的问题.例如,这是我从 Moby Dick 摘录的内容:

I'm using NLTK to analyze a few classic texts and I'm running in to trouble tokenizing the text by sentence. For example, here's what I get for a snippet from Moby Dick:

import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'

print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''

考虑到梅尔维尔(Melville)的语法有些陈旧,我不希望在这里达到完美,但是NLTK应该能够处理终端双引号和标题,例如"Mrs".但是,由于令牌生成器是无监督训练算法的结果,所以我不知道该如何修改.

I don't expect perfection here, considering that Melville's syntax is a bit dated, but NLTK ought to be able to handle terminal double quotes and titles like "Mrs." Since the tokenizer is the result of an unsupervised training algo, however, I can't figure out how to tinker with it.

有人建议使用更好的句子标记器吗?我更喜欢可以破解的简单启发式方法,而不必训练自己的解析器.

Anyone have recommendations for a better sentence tokenizer? I'd prefer a simple heuristic that I can hack rather than having to train my own parser.

推荐答案

您需要向令牌生成器提供缩写列表,如下所示:

You need to supply a list of abbreviations to the tokenizer, like so:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)

句子现在是

['is THAT what you mean, Mrs. Hussey?']

更新:如果句子的最后一个单词带有单引号或引号(例如 Hussey?'),则此方法不起作用.因此,一种快速而又肮脏的方法是在撇号和引号之前加上空格,然后在引号后面加上句号(.!?):

Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?'). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):

text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')

这篇关于如何调整NLTK句子标记器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆