句子与正则表达式匹配 [英] Sentence matching with regex

查看:197
本文介绍了句子与正则表达式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本,可以分成多行,没有特殊格式.因此,我决定为每行line.strip('\n').然后,我考虑使用句子结束标记.将文本分为句子:

I have a text that splits into many lines, no particular formats. So I decided to line.strip('\n') for each line. Then I want to split the text into sentences using the sentence end marker . considering:

  1. 期间.,后跟\s(空格),\S(如" ')和后跟[A-Z]将会拆分
  2. 不像1.stackoverflow real time solution一样拆分[0-9]\.[A-Za-z].
  1. period . that is followed by a \s (whitespace), \S (like " ') and followed by [A-Z] will split
  2. not to split [0-9]\.[A-Za-z], like 1.stackoverflow real time solution.

我的程序只求解1个句点(.)的一半,后跟一个\ s和[A-Z].下面是代码:

My program only solve half of 1 - period (.) that is followed by a \s and [A-Z]. Below is the code:

# -*- coding: utf-8 -*-
import re, sys

source = open(sys.argv[1], 'rb')
dest = open(sys.argv[2], 'wb')
sent = []
for line in source:
    line1 = line.strip('\n')
    k = re.sub(r'\.\s+([A-Z"])'.decode('utf8'), '.\n\g<1>', line1)
    sent.append(k)

for line in sent:
    dest.write(''.join(line))

请!我想知道哪种是掌握正则表达式的最佳方法.似乎令人困惑.

Pls! I'd like to know which is the best way to master regex. It seems to be confusing.

推荐答案

要将单引号包括在字符类中,请使用 \ 对其进行转义.正则表达式应为:

To include the single quote in the character class, escape it with a \. The regex should be:

\.\s+[A-Z"\']

这实际上就是您所需要的.您只需要告诉正则表达式要匹配的内容,就不需要指定您不想匹配的内容.一切与模式不符的事物都将不匹配.

That's really all you need. You only need to tell a regex what to match, you don't need to specify what you don't want to match. Everything that doesn't fit the pattern won't match.

此正则表达式将匹配任何句点,后跟空格,后跟大写字母或引号.由于紧接在数字前,紧随字母后的时间段不符合这些条件,因此不会匹配.

This regex will match any period followed by whitespace followed by a capital letter or a quote. Since a period immediately preceded by an number and immediately followed by a letter doesn't meet those criteria, it won't match.

这是假设您所使用的正则表达式正在按照句号分隔句点,空格和大写字母.但是请注意,这意味着I am Sam. Sam I am.将拆分为I am Samam I am.那真的是你想要的吗?如果不是,请使用零宽度断言排除要匹配但仍保留的部分.这是您的选择,按照我认为最可能想要的顺序进行.

This is assuming that the regex you had was working to split a period followed by whitespace followed by a capital, as you stated. Note, however, that this means that I am Sam. Sam I am. would split into I am Sam and am I am. Is that really what you want? If not, use zero-width assertions to exclude the parts you want to match but also keep. Here are your options, in order of what I think it's most likely you want.

1)保留句号和下一个句子的首字母或开头引号;失去空白:

1) Keep the period and the first letter or opening quote of the next sentence; lose the whitespace:

(?<=\.)\s+(?=[A-Z"\'])

这会将上面的示例分为I am Sam.Sam I am.

This will split the example above into I am Sam. and Sam I am.

2)保留下一个句子的第一个字母;丢失句号和空格:

2) Keep the first letter of the next sentence; lose the period and whitespace:

\.\s+(?=[A-Z"\'])

这将分为I am SamSam I am.假定后面有更多的句子,否则句号将与第二个句子一起保留,因为它后面没有空格和大写字母或引号.如果此选项是您想要的选项-句子不带句点,则您可能还需要匹配一个句点,其后是字符串的末尾,并带有可选的中间空格,以便最后一个句点和任何结尾的空格都将被删除:

This will split into I am Sam and Sam I am. This presumes that there are more sentences afterward, otherwise the period will stay with the second sentence, because it's not followed by whitespace and a capital letter or quote. If this option is the one you want - the sentences without the periods, then you might want to also match a period followed by the end of the string, with optional intervening whitespace, so that the final period and any trailing whitespace will be dropped:

\.(?:\s+(?=[A-Z"\'])|\s*$)

请注意?:.您需要非捕获括号,因为如果拆分中有捕获组,则捕获组中捕获的所有内容都会作为结果中的元素添加(例如split('(+)', 'a+b+c'为您提供a + b + c,而不仅仅是a b c).

Note the ?:. You need non-capturing parentheses, because if you have capture groups in a split, anything captured by the group is added as an element in the results (e.g. split('(+)', 'a+b+c' gives you an array of a + b + c rather than just a b c).

3)保留一切;空白与前面的句子搭配在一起:

3) Keep everything; whitespace goes with the preceding sentence:

(?<=\.\s+)(?=[A-Z"\'])

这将为您提供I am Sam. Sam I am.

关于您问题的最后一部分,我所见过的正则表达式语法的最佳资源是 http://www .regular-expressions.info .从此摘要开始: http://www.regular-expressions.info/reference.html然后转到教程"页面以获取更多高级详细信息: http://www.regular-expressions.info/tutorial.html

Regarding the last part of your question, the best resource for regex syntax I've seen is http://www.regular-expressions.info. Start with this summary: http://www.regular-expressions.info/reference.html Then go to the Tutorial page for more advanced details: http://www.regular-expressions.info/tutorial.html

这篇关于句子与正则表达式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆