NLTK Sentence Tokenizer,自定义句子开头 [英] NLTK Sentence Tokenizer, custom sentence starters
问题描述
我正在尝试使用 nltk 的 PunktSentenceTokenizer
将文本拆分成句子.文本包含以项目符号开头的列表,但它们不会被识别为新句子.我试图添加一些参数,但没有用.还有别的方法吗?
这是一些示例代码:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters参数 = PunktParameters()params.sent_starters = set(['•'])tokenizer = PunktSentenceTokenizer(params)tokenizer.tokenize('• 我是一个句子• 我是另一个句子')['•我是一个句子•我是另一个句子']
您可以子类化 PunktLanguageVars
并调整 sent_end_chars
属性以满足您的需求,如下所示:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVarsBulletPointLangVars(PunktLanguageVars) 类:sent_end_chars = ('.', '?', '!', '•')tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())tokenizer.tokenize(u"• 我是一个句子• 我是另一个句子")
这将导致以下输出:
['•', '我是一个句子•', '我是另一个句子']
但是,这使 • 成为句子结束标记,而在您的情况下,它更像是句子开始标记.因此这个示例文本:
<块引用>我介绍一个句子列表.
- 我是第一句
- 我是第二句
我也是其中之一!
根据文本的详细信息,会产生如下结果:
<预><代码>>>>tokenizer.tokenize("""看看这些句子:• 我是第一句• 我是第二句但我也是其中之一!""")['\n看看这些句子:\n\n•'、'我是第一句\n•'、'我是第二句\n\n但我也是一!\n']PunktSentenceTokenizer
用于句子标记化而不是简单地使用诸如多分隔符分割函数之类的东西的一个原因是,因为它能够学习如何区分用于句子的标点符号和使用的标点符号用于其他目的,例如先生".
但是,对于 • 应该没有这样的复杂性,所以我建议您编写一个简单的解析器来预处理项目符号点格式,而不是滥用 PunktSentenceTokenizer
来处理它不是真正设计的东西为了.如何详细实现这一点取决于文本中如何使用这种标记.
I'm trying to split a text into sentences with the PunktSentenceTokenizer
from nltk. The text contains listings starting with bullet points, but they are not recognized as new sentences. I tried to add some parameters but that didn't work. Is there another way?
Here is some example code:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
params = PunktParameters()
params.sent_starters = set(['•'])
tokenizer = PunktSentenceTokenizer(params)
tokenizer.tokenize('• I am a sentence • I am another sentence')
['• I am a sentence • I am another sentence']
You can subclass PunktLanguageVars
and adapt the sent_end_chars
attribute to fit your needs like so:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
class BulletPointLangVars(PunktLanguageVars):
sent_end_chars = ('.', '?', '!', '•')
tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())
tokenizer.tokenize(u"• I am a sentence • I am another sentence")
This will result in the following output:
['•', 'I am a sentence •', 'I am another sentence']
However, this makes • a sentence end marker, while in your case it is more of a sentence start marker. Thus this example text:
I introduce a list of sentences.
- I am sentence one
- I am sentence two
And I am one, too!
Would, depending on the details of your text, result in something like the following:
>>> tokenizer.tokenize("""
Look at these sentences:
• I am sentence one
• I am sentence two
But I am one, too!
""")
['\nLook at these sentences:\n\n•', 'I am sentence one\n•', 'I am sentence two\n\nBut I am one, too!\n']
One reason why PunktSentenceTokenizer
is used for sentence tokenization instead of simply employing something like a multi-delimiter split function is, because it is able to learn how to distinguish between punctuation used for sentences and punctuation used for other purposes, as in "Mr.", for example.
There should, however, be no such complications for •, so you I would advise you to write a simple parser to preprocess the bullet point formatting instead of abusing PunktSentenceTokenizer
for something it is not really designed for.
How this might be achieved in detail is dependent on how exactly this kind of markup is used in the text.
这篇关于NLTK Sentence Tokenizer,自定义句子开头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!