句子与正则表达式匹配 [英] Sentence matching with regex

查看：197 发布时间：2020/7/11 0:26:45 python regex python-2.7 text-segmentation

本文介绍了句子与正则表达式匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文本，可以分成多行，没有特殊格式.因此，我决定为每行line.strip('\n').然后，我考虑使用句子结束标记.将文本分为句子:

I have a text that splits into many lines, no particular formats. So I decided to line.strip('\n') for each line. Then I want to split the text into sentences using the sentence end marker . considering:

期间.，后跟\s(空格)，\S(如" ')和后跟[A-Z]将会拆分
不像1.stackoverflow real time solution一样拆分[0-9]\.[A-Za-z].

period . that is followed by a \s (whitespace), \S (like " ') and followed by [A-Z] will split
not to split [0-9]\.[A-Za-z], like 1.stackoverflow real time solution.

我的程序只求解1个句点(.)的一半，后跟一个\ s和[A-Z].下面是代码:

My program only solve half of 1 - period (.) that is followed by a \s and [A-Z]. Below is the code:

# -*- coding: utf-8 -*-
import re, sys

source = open(sys.argv[1], 'rb')
dest = open(sys.argv[2], 'wb')
sent = []
for line in source:
    line1 = line.strip('\n')
    k = re.sub(r'\.\s+([A-Z"])'.decode('utf8'), '.\n\g<1>', line1)
    sent.append(k)

for line in sent:
    dest.write(''.join(line))

请！我想知道哪种是掌握正则表达式的最佳方法.似乎令人困惑.

Pls! I'd like to know which is the best way to master regex. It seems to be confusing.

推荐答案

要将单引号包括在字符类中，请使用 \ 对其进行转义.正则表达式应为:

To include the single quote in the character class, escape it with a \. The regex should be:

\.\s+[A-Z"\']

这实际上就是您所需要的.您只需要告诉正则表达式要匹配的内容，就不需要指定您不想匹配的内容.一切与模式不符的事物都将不匹配.

That's really all you need. You only need to tell a regex what to match, you don't need to specify what you don't want to match. Everything that doesn't fit the pattern won't match.

此正则表达式将匹配任何句点，后跟空格，后跟大写字母或引号.由于紧接在数字前，紧随字母后的时间段不符合这些条件，因此不会匹配.

This regex will match any period followed by whitespace followed by a capital letter or a quote. Since a period immediately preceded by an number and immediately followed by a letter doesn't meet those criteria, it won't match.

这是假设您所使用的正则表达式正在按照句号分隔句点，空格和大写字母.但是请注意，这意味着I am Sam. Sam I am.将拆分为I am Sam和am I am.那真的是你想要的吗?如果不是，请使用零宽度断言排除要匹配但仍保留的部分.这是您的选择，按照我认为最可能想要的顺序进行.

This is assuming that the regex you had was working to split a period followed by whitespace followed by a capital, as you stated. Note, however, that this means that I am Sam. Sam I am. would split into I am Sam and am I am. Is that really what you want? If not, use zero-width assertions to exclude the parts you want to match but also keep. Here are your options, in order of what I think it's most likely you want.

1)保留句号和下一个句子的首字母或开头引号；失去空白:

1) Keep the period and the first letter or opening quote of the next sentence; lose the whitespace:

(?<=\.)\s+(?=[A-Z"\'])

这会将上面的示例分为I am Sam.和Sam I am.

This will split the example above into I am Sam. and Sam I am.

2)保留下一个句子的第一个字母；丢失句号和空格:

2) Keep the first letter of the next sentence; lose the period and whitespace:

\.\s+(?=[A-Z"\'])

这将分为I am Sam和Sam I am.假定后面有更多的句子，否则句号将与第二个句子一起保留，因为它后面没有空格和大写字母或引号.如果此选项是您想要的选项-句子不带句点，则您可能还需要匹配一个句点，其后是字符串的末尾，并带有可选的中间空格，以便最后一个句点和任何结尾的空格都将被删除:

This will split into I am Sam and Sam I am. This presumes that there are more sentences afterward, otherwise the period will stay with the second sentence, because it's not followed by whitespace and a capital letter or quote. If this option is the one you want - the sentences without the periods, then you might want to also match a period followed by the end of the string, with optional intervening whitespace, so that the final period and any trailing whitespace will be dropped:

\.(?:\s+(?=[A-Z"\'])|\s*$)

请注意?:.您需要非捕获括号，因为如果拆分中有捕获组，则捕获组中捕获的所有内容都会作为结果中的元素添加(例如split('(+)', 'a+b+c'为您提供a + b + c，而不仅仅是a b c).

Note the ?:. You need non-capturing parentheses, because if you have capture groups in a split, anything captured by the group is added as an element in the results (e.g. split('(+)', 'a+b+c' gives you an array of a + b + c rather than just a b c).

3)保留一切；空白与前面的句子搭配在一起:

3) Keep everything; whitespace goes with the preceding sentence:

(?<=\.\s+)(?=[A-Z"\'])

这将为您提供I am Sam. 和Sam I am.

关于您问题的最后一部分，我所见过的正则表达式语法的最佳资源是 http://www .regular-expressions.info .从此摘要开始: http://www.regular-expressions.info/reference.html然后转到教程"页面以获取更多高级详细信息: http://www.regular-expressions.info/tutorial.html

Regarding the last part of your question, the best resource for regex syntax I've seen is http://www.regular-expressions.info. Start with this summary: http://www.regular-expressions.info/reference.html Then go to the Tutorial page for more advanced details: http://www.regular-expressions.info/tutorial.html

这篇关于句子与正则表达式匹配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

句子与正则表达式匹配 [英] Sentence matching with regex

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

句子与正则表达式匹配 [英] Sentence matching with regex

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭