NLTK 中 PunktSentenceTokenizer 的使用 [英] Use of PunktSentenceTokenizer in NLTK

查看:28
本文介绍了NLTK 中 PunktSentenceTokenizer 的使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 NLTK 学习自然语言处理.我遇到了使用 PunktSentenceTokenizer 的代码,我在给定的代码中无法理解其实际用途.给出了代码:

import nltk从 nltk.corpus 导入 state_union从 nltk.tokenize 导入 PunktSentenceTokenizertrain_text = state_union.raw("2005-GWBush.txt")sample_text = state_union.raw("2006-GWBush.txt")custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A标记化 = custom_sent_tokenizer.tokenize(sample_text) #Bdef process_content():尝试:对于标记化 [:5] 中的 i:单词 = nltk.word_tokenize(i)标记 = nltk.pos_tag(words)打印(标记)除了作为 e 的例外:打印(str(e))进程内容()

那么,我们为什么要使用 PunktSentenceTokenizer.标记 A 和 B 的行中发生了什么.我的意思是有一个训练文本,另一个是示例文本,但是需要两个数据集才能获得词性标记.

标记为 AB 的行是我无法理解的.

PS:我确实尝试查看 NLTK 的书,但无法理解 PunktSentenceTokenizer 的真正用途是什么

解决方案

PunktSentenceTokenizer 是默认句子分词器的抽象类,即 sent_tokenize(),在 NLTK 中提供.它是 无监督多语言句的实现边界检测(Kiss and Strunk (2005).见https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79

给定一个包含多个句子的段落,例如:

<预><代码>>>>从 nltk.corpus 导入 state_union>>>train_text = state_union.raw("2005-GWBush.txt").split(' ')>>>train_text[11]两周前,我站在这座国会大厦的台阶上,重申了我们国家对所有人自由的指导理想的承诺.今晚,我将制定政策,在国内和世界范围内推进这一理想.'

您可以使用sent_tokenize():

<预><代码>>>>sent_tokenize(train_text[11])[你两周前,我站在国会大厦的台阶上,重申我们国家对所有人自由的指导理想的承诺.',你今天晚上我将制定政策,在国内和周围推进这一理想世界.']>>>对于在 sent_tokenize(train_text[11]) 中发送:...打印发送... 打印 ' - - - - '...两周前,我站在这座国会大厦的台阶上,重申了我们国家对所有人自由的指导理想的承诺.--------今晚,我将制定政策,在国内和世界范围内推进这一理想.--------

sent_tokenize() 使用来自 nltk_data/tokenizers/punkt/english.pickle 的预训练模型.您也可以指定其他语言,NLTK 中带有预训练模型的可用语言列表是:

alvas@ubi:~/nltk_data/tokenizers/punkt$ lsczech.pickle finnish.pickle norwegian.pickle slovene.pickle丹麦语.pickle 法语.pickle 波兰语.pickle 西班牙语.pickledutch.pickle German.pickle portuguese.pickle swedish.pickleenglish.pickle greek.pickle PY3 turkish.pickleestonian.pickle italian.pickle 自述文件

给定另一种语言的文本,请执行以下操作:

<预><代码>>>>German_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über Orgellandschaft das Gebiet der Landkreise Goslar, Göttingen, Hameln-PyrmontTeilen erhalten 的 vollständig oder.">>>对于在 sent_tokenize(german_text, language='german') 中发送:...打印发送... 打印 ' - - - - -'...Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar、Göttingen、Hameln-Pyrmont、Hildesheim、Holzminden、Northeim 和 Osterode am Harz sowie die Stadt Salzgitter.---------Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten.---------

要训练您自己的 punkt 模型,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.pynltk punkt 的训练数据格式

I am learning Natural Language Processing using NLTK. I came across the code using PunktSentenceTokenizer whose actual use I cannot understand in the given code. The code is given :

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A

tokenized = custom_sent_tokenizer.tokenize(sample_text)   #B

def process_content():
try:
    for i in tokenized[:5]:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
        print(tagged)

except Exception as e:
    print(str(e))


process_content()

So, why do we use PunktSentenceTokenizer. And what is going on in the line marked A and B. I mean there is a training text and the other a sample text, but what is the need for two data sets to get the Part of Speech tagging.

Line marked as A and B is which I am not able to understand.

PS : I did try to look in the NLTK book but could not understand what is the real use of PunktSentenceTokenizer

解决方案

PunktSentenceTokenizer is the abstract class for the default sentence tokenizer, i.e. sent_tokenize(), provided in NLTK. It is an implmentation of Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk (2005). See https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79

Given a paragraph with multiple sentence, e.g:

>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('
')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '

You can use the sent_tokenize():

>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
...     print sent
...     print '--------'
... 
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world. 
--------

The sent_tokenize() uses a pre-trained model from nltk_data/tokenizers/punkt/english.pickle. You can also specify other languages, the list of available languages with pre-trained models in NLTK are:

alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
danish.pickle    french.pickle   polish.pickle      spanish.pickle
dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
english.pickle   greek.pickle    PY3                turkish.pickle
estonian.pickle  italian.pickle  README

Given a text in another language, do this:

>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "

>>> for sent in sent_tokenize(german_text, language='german'):
...     print sent
...     print '---------'
... 
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. 
---------

To train your own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py and training data format for nltk punkt

这篇关于NLTK 中 PunktSentenceTokenizer 的使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆