NLTK punkt的训练数据格式 [英] training data format for NLTK punkt

查看:208
本文介绍了NLTK punkt的训练数据格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想运行nltk Punkt来拆分句子.没有训练模型,所以我分别训练模型,但是我不确定我使用的训练数据格式是否正确.

I would like to run nltk Punkt to split sentences. There is no training model so I train model separately, but I am not sure if the training data format I am using is correct.

我的训练数据是每行一句话.我找不到与此有关的任何文档,仅此线程( https://groups.google.com/forum/#!topic/nltk-users/bxIEnmgeCSM )阐明了训练数据格式.

My training data is one sentence per line. I wasn't able to find any documentation about this, only this thread (https://groups.google.com/forum/#!topic/nltk-users/bxIEnmgeCSM) sheds some light about training data format.

NLTK Punkt句子标记器的正确训练数据格式是什么?

What is the correct training data format for NLTK Punkt sentence tokenizer?

推荐答案

是的,Punkt标记生成器是不可思议的无监督语句边界检测.而且作者的姓氏也很酷, Kiss和Strunk(2006) .这个想法是使用 NO注释来训练句子边界检测器,因此输入将是任何种类的明文(只要编码是一致的).

Ah yes, Punkt tokenizer is the magical unsupervised sentence boundary detection. And the author's last name is pretty cool too, Kiss and Strunk (2006). The idea is to use NO annotation to train a sentence boundary detector, hence the input will be ANY sort of plaintext (as long as the encoding is consistent).

要训练新模型,只需使用:

To train a new model, simply use:

import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt","r","utf8").read()
tokenizer.train(text)
out = open("someplain.pk","wb")
pickle.dump(tokenizer, out)
out.close()

要获得更高的精度并允许您随时停止培训,并且仍为令牌生成器保存适当的泡菜,请查看此代码段,以培训德语句子令牌生成器,

To achieve higher precision and allow you to stop training at any time and still save a proper pickle for your tokenizer, do look at this code snippet for training a German sentence tokenizer, https://github.com/alvations/DLTK/blob/master/dltk/tokenize/tokenizer.py :

def train_punktsent(trainfile, modelfile):
  """ Trains an unsupervised NLTK punkt sentence tokenizer. """
  punkt = PunktTrainer()
  try:
    with codecs.open(trainfile, 'r','utf8') as fin:
      punkt.train(fin.read(), finalize=False, verbose=False)
  except KeyboardInterrupt:
    print 'KeyboardInterrupt: Stopping the reading of the dump early!'
  ##HACK: Adds abbreviations from rb_tokenizer.
  abbrv_sent = " ".join([i.strip() for i in \
                         codecs.open('abbrev.lex','r','utf8').readlines()])
  abbrv_sent = "Start"+abbrv_sent+"End."
  punkt.train(abbrv_sent,finalize=False, verbose=False)
  # Finalize and outputs trained model.
  punkt.finalize_training(verbose=True)
  model = PunktSentenceTokenizer(punkt.get_params())
  with open(modelfile, mode='wb') as fout:
    pickle.dump(model, fout, protocol=pickle.HIGHEST_PROTOCOL)
  return model

但是请注意,周期检测对拉丁句号,问号和感叹号非常敏感.如果您要为不使用拉丁语拼字法的其他语言训练punkt标记器,则需要以某种方式修改代码以使用适当的句子边界标点符号.如果使用的是NLTK的punkt实现,请编辑sent_end_chars变量.

However do note that the period detection is very sensitive to the latin fullstop, question mark and exclamation mark. If you're going to train a punkt tokenizer for other languages that doesn't use latin orthography, you'll need to somehow hack the code to use the appropriate sentence boundary punctuation. If you're using NLTK's implementation of punkt, edit the sent_end_chars variable.

除了使用nltk.tokenize.sent_tokenize()的默认"英语令牌生成器外,还有一些可用的预训练模型.它们是: https://github.com/evandrix/nltk_data/tree/大师/代币破解者/punkt

There are pre-trained models available other than the 'default' English tokenizer using nltk.tokenize.sent_tokenize(). Here they are: https://github.com/evandrix/nltk_data/tree/master/tokenizers/punkt

请注意,由于上面列出的nltk_data github存储库已被删除,因此当前无法使用经过预训练的模型.

Note the pre-trained models are currently not available because the nltk_data github repo listed above has been removed.

这篇关于NLTK punkt的训练数据格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆