NLTK句子标记器是否假定正确的标点符号和空格? [英] Does the NLTK sentence tokenizer assume correct punctuation and spacing?

查看:96
本文介绍了NLTK句子标记器是否假定正确的标点符号和空格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用NLTK拆分句子,并且我注意到它会将两个之间没有空格的句子视为一个句子.例如:

I'm trying to split sentences using NLTK and I've noticed it treats sentences without a whitespace in between as one sentence. For instance:

text = 'Today is Monday.I went shopping.'
sentences = sent_tokenize(text)
# 1) Today is Monday.I went shopping.

text = 'Today is Monday. I went shopping.'
sentences = sent_tokenize(text)
# 1) Today is Monday.
# 2) I went shopping.

有没有办法正确分割标点符号/错位的句子?

Is there a way to properly split mispunctuated/misspaced sentences?

推荐答案

虽然对于大多数西方语言而言,句子分割并不是很复杂,但是您遇到过的句子分割仍然时有发生.有几种工具(例如,opennlp和corenlp都有各自的模块),nltk的send_tokenize非常基本,并且可能基于正则表达式.您可以使用以下内容来修复"您的输出:

While sentence segmentation is not very complicated for most western languages, as you've encountered it still goes off every now and then. There's several tools for this (opennlp and corenlp both have their own modules for this, for example), sent_tokenize from nltk is fairly rudimentary, and probably regex-based. You can 'repair' your output with something like the following:

import re
s = 'Today is Monday.I went shopping.Tomorrow is Tuesday.'
slices = []
for match in re.finditer('\w\.\w', s):
    slices.append(match.start()+2)
slices.append(len(s))
offset = 0
subsentences = []
for pos in sorted(slices):
    subsent = s[offset:pos]
    offset += len(subsent)
    subsentences.append(subsent)
print(subsentences)

在字字符上分割字符串,后跟点和字字符.请注意,单词字符实际上包含数字,因此您可能需要将其更改为[a-zA-Z]或其他内容,也许还要更改.任何标点符号.

Which splits strings on word characters followed by a dot followed by word characters. Mind that word characters actually include digits, so you may want to change this for [a-zA-Z] or something, and perhaps also the . for any punctuation character.

这篇关于NLTK句子标记器是否假定正确的标点符号和空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆