将文本拆分为句子 [英] Split text into sentences
问题描述
我希望将文本分成句子.谁能帮我吗?
I wish to split text into sentences. Can anyone help me?
我还需要处理缩写.但是,我的计划是在早期阶段替换它们.先生->先生
I also need to handle abbreviations. However my plan is to replace these at an earlier stage. Mr. -> Mister
import re
import unittest
class Sentences:
def __init__(self,text):
self.sentences = tuple(re.split("[.!?]\s", text))
class TestSentences(unittest.TestCase):
def testFullStop(self):
self.assertEquals(Sentences("X. X.").sentences, ("X.","X."))
def testQuestion(self):
self.assertEquals(Sentences("X? X?").sentences, ("X?","X?"))
def testExclaimation(self):
self.assertEquals(Sentences("X! X!").sentences, ("X!","X!"))
def testMixed(self):
self.assertEquals(Sentences("X! X? X! X.").sentences, ("X!", "X?", "X!", "X."))
谢谢, 巴里
首先,我将很高兴满足上面包含的四个测试.这将有助于我更好地了解正则表达式的工作方式.现在,我可以按照测试中的定义将句子定义为X等.
To start with, I would be happy to satisfy the four tests I've included above. This would help me understand better how regexs work. For now I can define a sentence as X. etc as defined in my tests.
推荐答案
句子分割可能是一项非常困难的任务,尤其是当文本包含点缀的缩写时.可能需要使用已知缩写的列表,或训练分类器来识别它们.
Sentence Segmentation can be a very difficult task, especially when the text contains dotted abbreviations. it may require a use of lists of known abbreviations, or training classifier to recognize them.
我建议您使用NLTK-这是一套用于自然语言处理的开源Python模块.
I suggest you to use NLTK - it a suite of open source Python modules, designed for natural language processing.
您可以使用NLTK阅读有关句子细分的信息 这里 , 并自行决定此工具是否适合您.
You can read about Sentence Segmentation using NLTK here, and decide for yourself if this tool fits you.
已或更简单的 ,这是 源代码 . 这是NLTK中包含的Punkt句子标记器.
EDITED: or even simpler here and here is the source code. This is The Punkt sentence tokenizer, included in NLTK.
这篇关于将文本拆分为句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!