NLTK 分词器和斯坦福 corenlp 分词器不能在句点 (.) [英] NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

查看:84
本文介绍了NLTK 分词器和斯坦福 corenlp 分词器不能在句点 (.)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据集中有 2 个句子:

w1 = 我是 Puseen the cat.我太可爱了.# 句号后没有空格
w2 = 我是猫 Pusheen.我很可爱.# 句点后有空格

当我使用 NKTL 标记器(word 和 sent)时,nltk 无法区分 cat.I.

这里是词标记化

<预><代码>>>>nltk.word_tokenize(w1, 'english')['我', '我', 'Pusheen', 'the', 'cat.I', 'am', 'so', '可爱']>>>nltk.word_tokenize(w2, 'english')['我', '我', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', '可爱']

并发送标记化

<预><代码>>>>nltk.sent_tokenize(w1, 'english')['我是Pusheen猫.我很可爱']>>>nltk.sent_tokenize(w2, 'english')['我是Pusheen猫.','我很可爱']

我想问一下怎么解决?即:在我的数据集中使 nlkt 检测为 w2,有时单词和标点符号会粘在一起.

更新:尝试了 Stanford CoreNLP 3.7.0,他们也不能将 'cat.I' 区分为 'cat'、'.'、'I'

meow@meow-server:~/projects/stanfordcorenlp$ java edu.stanford.nlp.process.PTBTokenizer sample.txt一世是普辛这猫.I是所以可爱的.PTBTokenizer 以每秒 111.21 个令牌的速度对 9 个令牌进行令牌化.

解决方案

它是故意这样实现的——后面没有空格的句号通常并不表示句子的结束(想想短语中的句号,例如如4.3 版"、即"、AM"等).如果您有一个语料库,其中句号结尾没有空格的情况很常见,您必须在将文本发送到 NLTK 之前使用正则表达式或类似表达式对文本进行预处理.

一个好的经验法则可能是通常一个小写字母后跟一个句点后跟一个大写字母通常表示一个句子的结束.在这种情况下,要在句点后插入一个空格,您可以使用正则表达式,例如

导入重新w1 = re.sub(r'([a-z])\.([A-Z])', r'\1.\2', w1)

I have 2 sentences in my dataset:

w1 = I am Pusheen the cat.I am so cute. # no space after period
w2 = I am Pusheen the cat. I am so cute. # with space after period

When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I.

Here is word tokenize

>>> nltk.word_tokenize(w1, 'english')
['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute']
>>> nltk.word_tokenize(w2, 'english')
['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute']

and sent tokenize

>>> nltk.sent_tokenize(w1, 'english')
['I am Pusheen the cat.I am so cute']
>>> nltk.sent_tokenize(w2, 'english')
['I am Pusheen the cat.', 'I am so cute']

I would like to ask how to fix that ? i.e: make nlkt detect as w2 while in my dataset, sometime word and punctuation are stick together.

Update: Tried Stanford CoreNLP 3.7.0, they also cannot distinct 'cat.I' as 'cat', '.', 'I'

meow@meow-server:~/projects/stanfordcorenlp$ java edu.stanford.nlp.process.PTBTokenizer sample.txt
I
am
Pusheen
the
cat.I
am
so
cute
.
PTBTokenizer tokenized 9 tokens at 111.21 tokens per second.

解决方案

It's implemented this way on purpose -- a period with no space after it usually doesn't signify the end of a sentence (think about the periods in phrases such as "version 4.3", "i.e.", "A.M.", etc.). If you have a corpus in which ends of sentences with no space after the full stop is a common occurrence, you'll have to preprocess the text with a regular expression or some such before sending it to NLTK.

A good rule-of-thumb might be that usually a lowercase letter followed by a period followed by an uppercase letter usually signifies the end of a sentence. To insert a space after the period in such cases, you could use a regular expression, e.g.

import re
w1 = re.sub(r'([a-z])\.([A-Z])', r'\1. \2', w1)

这篇关于NLTK 分词器和斯坦福 corenlp 分词器不能在句点 (.)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆