NLTK Sentence Tokenizer,自定义句子开头 [英] NLTK Sentence Tokenizer, custom sentence starters

查看:133
本文介绍了NLTK Sentence Tokenizer,自定义句子开头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 nltk 的 PunktSentenceTokenizer 将文本拆分成句子.文本包含以项目符号开头的列表,但它们不会被识别为新句子.我试图添加一些参数,但没有用.还有别的方法吗?

这是一些示例代码:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters参数 = PunktParameters()params.sent_starters = set(['•'])tokenizer = PunktSentenceTokenizer(params)tokenizer.tokenize('• 我是一个句子• 我是另一个句子')['•我是一个句子•我是另一个句子']

解决方案

您可以子类化 PunktLanguageVars 并调整 sent_end_chars 属性以满足您的需求,如下所示:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVarsBulletPointLangVars(PunktLanguageVars) 类:sent_end_chars = ('.', '?', '!', '•')tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())tokenizer.tokenize(u"• 我是一个句子• 我是另一个句子")

这将导致以下输出:

['•', '我是一个句子•', '我是另一个句子']

但是,这使 • 成为句子结束标记,而在您的情况下,它更像是句子开始标记.因此这个示例文本:

<块引用>

我介绍一个句子列表.

  • 我是第一句
  • 我是第二句

我也是其中之一!

根据文本的详细信息,会产生如下结果:

<预><代码>>>>tokenizer.tokenize("""看看这些句子:• 我是第一句• 我是第二句但我也是其中之一!""")['\n看看这些句子:\n\n•'、'我是第一句\n•'、'我是第二句\n\n但我也是一!\n']

PunktSentenceTokenizer 用于句子标记化而不是简单地使用诸如多分隔符分割函数之类的东西的一个原因是,因为它能够学习如何区分用于句子的标点符号和使用的标点符号用于其他目的,例如先生".

但是,对于 • 应该没有这样的复杂性,所以我建议您编写一个简单的解析器来预处理项目符号点格式,而不是滥用 PunktSentenceTokenizer 来处理它不是真正设计的东西为了.如何详细实现这一点取决于文本中如何使用这种标记.

I'm trying to split a text into sentences with the PunktSentenceTokenizer from nltk. The text contains listings starting with bullet points, but they are not recognized as new sentences. I tried to add some parameters but that didn't work. Is there another way?

Here is some example code:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

params = PunktParameters()
params.sent_starters = set(['•'])
tokenizer = PunktSentenceTokenizer(params)

tokenizer.tokenize('• I am a sentence • I am another sentence')
['• I am a sentence • I am another sentence']

解决方案

You can subclass PunktLanguageVars and adapt the sent_end_chars attribute to fit your needs like so:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars

class BulletPointLangVars(PunktLanguageVars):
    sent_end_chars = ('.', '?', '!', '•')

tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())
tokenizer.tokenize(u"• I am a sentence • I am another sentence")

This will result in the following output:

['•', 'I am a sentence •', 'I am another sentence']

However, this makes • a sentence end marker, while in your case it is more of a sentence start marker. Thus this example text:

I introduce a list of sentences.

  • I am sentence one
  • I am sentence two

And I am one, too!

Would, depending on the details of your text, result in something like the following:

>>> tokenizer.tokenize("""
Look at these sentences:

• I am sentence one
• I am sentence two

But I am one, too!
""")

['\nLook at these sentences:\n\n•', 'I am sentence one\n•', 'I am sentence two\n\nBut I am one, too!\n']

One reason why PunktSentenceTokenizer is used for sentence tokenization instead of simply employing something like a multi-delimiter split function is, because it is able to learn how to distinguish between punctuation used for sentences and punctuation used for other purposes, as in "Mr.", for example.

There should, however, be no such complications for •, so you I would advise you to write a simple parser to preprocess the bullet point formatting instead of abusing PunktSentenceTokenizer for something it is not really designed for. How this might be achieved in detail is dependent on how exactly this kind of markup is used in the text.

这篇关于NLTK Sentence Tokenizer,自定义句子开头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆