NLP-句子细分 [英] NLP - Sentence Segmentation
问题描述
我是一名新手,正在NLP中尝试句子分割. 我知道TokenTK可以在NLTK中使用.但是我想使用像决策树这样的机器学习算法来构建自己的句子分割器.但是我无法为此收集训练数据.数据应该如何.由于我想首先尝试使用监督学习,因此应该如何对其进行标记.任何可用的样本数据?任何帮助将是有用的.我在网上搜索了将近一个星期,现在发布了同样的内容以寻求帮助.预先感谢.
I am a newbie trying my hands on sentence segmentation in NLP. I am aware tokenizers are available for the same in NLTK. But I wanted to build my own sentence segmenter using Machine Learning algorithm like Decision Tree. But I am not able to gather training data for it. How should be the data. How should it be labelled, since I wanted to try first using supervised learning. Any sample data already available? Any help will be useful. I searched in net for nearly a week and now posting the same for help. Thanks in advance.
推荐答案
据我所知,句子拆分器通常实现为具有一组规则(要考虑的标点符号)和一些自动学习的权重(用于例外,例如带句号的缩写,不能用作句号).权重可以在没有监督的情况下学习.
As far as I know, Sentence splitters are typically implemented as a hybrid with a set of rules (the punctuation characters to consider) and some automatically learnt weights (for exceptions, such as abbreviations with a period, which don't act as a full stop). The weights can be learnt without supervision.
但是,使用简单的基于ML的系统来解决这个问题是一个有趣的想法.对于监督方案,您可以尝试使用带有BIO
标签的基于字符的序列标签模型.例如,您的训练数据可能如下所示:
It's an interesting idea, however, to approach this with a plain ML-based system. For a supervised scheme, you could try a character-based sequence-labelling model with BIO
labels. For example, your training data could look like this:
This is it! I'm leaving Dr. Smush in his box.
BIIIIIIIIIIOBIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
预测的输出也将是BIIIIO...
,并且您必须在标有O
的字符处分割原始文本.
我不确定这是否是最好的方法,但是如果您尝试使用它,请告诉我它是否有用.确保使用高位 n 克(3克,4克,5克甚至更高),因为它们是字符,而不是单词标记.
The predicted output will then also be BIIIIO...
, and you'll have to split the original text at the characters labelled O
.
I'm not sure if this is the best approach, but if you try it, let me know if it's any good. Make sure you use n-grams of high orders (3-, 4-, 5-grams or even higher), since these are characters, not word tokens.
对于训练数据,您可以使用任何带有语言注释的语料库,因为它们都是句子分割的(例如,查看NLTK中包含的那些).
您要做的就是制作BIO
标签进行培训.
As for the training data, you can use any linguistically annotated corpus, since they are all sentence-split (eg. look at the ones included in NLTK).
All you have to do is producing the BIO
labels for training.
这篇关于NLP-句子细分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!