Python-RegEx,用于将文本拆分为句子(句子加令牌) [英] Python - RegEx for splitting text into sentences (sentence-tokenizing)

查看:117
本文介绍了Python-RegEx,用于将文本拆分为句子(句子加令牌)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从一个字符串中列出一个句子列表,然后将它们打印出来.我不想使用NLTK来做到这一点.因此,它需要在句子末尾的句号上进行分割,而不是小数点,缩写或名称的标题,或者句子中带有.com的句号.这是对正则表达式的尝试无效.

I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.

import re

text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

for stuff in sentences:
        print(stuff)    

示例输出,其外观应为

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. 
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

推荐答案

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

尝试一下.分割您的字符串.您还可以查看演示.

Try this. split your string this.You can also check demo.

http://regex101.com/r/nG1gU7/27

这篇关于Python-RegEx,用于将文本拆分为句子(句子加令牌)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆