一个正则表达式,用于从python的一个段落中提取句子 [英] a Regex for extracting sentence from a paragraph in python

查看:696
本文介绍了一个正则表达式,用于从python的一个段落中提取句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python中的正则表达式从段落中提取句子.
通常,我正在测试的代码会正确提取句子,但是在下面的段落中,句子不会被正确提取.

I'm trying to extract a sentence from a paragraph using regular expressions in python.
Usually the code that I'm testing extracts the sentence correctly, but in the following paragraph the sentence does not get extracted correctly.

该段:

但是在疟疾感染和败血症的情况下,全身的树突状细胞集中于警告免疫系统,这阻止了它们检测和应对任何新的感染." 一种新型疫苗?

"But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections." A new type of vaccine?

代码:

def splitParagraphIntoSentences(paragraph):

import re

sentenceEnders = re.compile('[.!?][\s]{1,2}(?=[A-Z])')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
    f = open("bs.txt", 'r')
    text = f.read()
    mylist = []
    sentences = splitParagraphIntoSentences(text)
    for s in sentences:
        mylist.append(s.strip())
        for i in mylist:
            print i

在与以上段落一起测试时,其输出与输入段落完全相同,但输出应类似于-

When tested with the above paragraph it gives output exactly as the input paragraph but the output should look like-

但是在疟疾感染和败血症的情况下,全身的树突状细胞都集中在警报免疫系统上,这阻止了它们检测和应对任何新的感染

But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections

一种新型疫苗

正则表达式有什么问题吗?

Is there anything wrong with the regular expression?

推荐答案

您作为示例发布的段落的第一句 用双引号"括起来,并且立即引号引起来 在句号之后:感染."

The paragraph you've posted as an example has its first sentence enclosed in double quotes ", and the closing quote comes immediately after the full stop: infections."

您的正则表达式[.!?]\s{1,2}正在寻找一个句点,其后是一个或 两个空格作为句子终止符,因此不会被抓住.

Your regexp [.!?]\s{1,2} is looking for a period followed by one or two spaces as sentence terminator, so it won't catch it.

通过允许可选,可以对其进行调整以应对这种情况 结束语:

It can be adjusted to cope with this case by allowing for optional closing quotes:

sentenceEnders = re.compile(r'''[.!?]['"]?\s{1,2}(?=[A-Z])''')

但是,使用上述正则表达式,您将删除引号 从句子中.保持它有些棘手,可以做到 使用后置断言:

However, with the above regexp you would be removing the end quote from the sentence. Keeping it is slightly more tricky and can be done using a look-behind assertion:

sentenceEnders = re.compile(r'''(?<=[.!?]['"\s])\s*(?=[A-Z])''')

但是请注意,在许多情况下,基于正则表达式的分割器 失败,例如:

Note, however, that there are a lot of cases where a regexp-based splitter fails, e.g.:

  • 缩写:在A. B. Givental博士的作品中……" - 根据您的正则表达式,这将在之后被错误地分割 博士" "A." "B." (您可以调整单字母大小写, 但除非您对其进行硬编码,否则无法检测到缩写.)

  • Abbreviations: "In the works of Dr. A. B. Givental ..." -- according to your regexp, this will be incorrectly split after "Dr.", "A." and "B." (You can adjust the single-letter case, but you cannot detect an abbreviation unless you hard-code it.)

在句子中间使用感叹号: "......什么时候,瞧瞧!M.Deshayes亲自出现..."

Use of exclamation marks in the middle of the sentence: "... when, lo and behold! M. Deshayes himself appeared..."

使用多个引号和嵌套引号等.

Use of multiple quote marks and nested quotes, etc.

这篇关于一个正则表达式,用于从python的一个段落中提取句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆