正则表达式匹配前面没有字符串的字符 [英] Regex match characters when not preceded by a string
问题描述
我试图在标点符号之后匹配空格,以便我可以拆分大量文本,但我看到一些常见的边缘情况,包括地点、标题和常见缩写:
I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:
I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith
我将它与 Python 3 中的 re.split
函数一起使用,我想得到这个:
I am using this with the re.split
function in Python 3 I want to get this:
["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]
这是目前我的正则表达式:
This is currently my regex:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)
我决定首先尝试修复No.
,最后两个条件.但它依赖于独立匹配 N
和 o
,我认为这会在其他地方出现误报.我不知道如何让它在句点后面只生成字符串 No
.然后,我将对 Sgt.
和我遇到的任何其他问题"字符串使用类似的方法.
I decided to try to fix the No.
first, with the last two conditions. But it relies on matching the N
and the o
independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No
behind the period. I will then use a similar approach for Sgt.
and any other "problem" strings I come across.
我正在尝试使用类似的东西:
I am trying to use something like:
<代码>(?<=[\.\?\!])(?<=[^AZ].)(?<=[^0-9].)^(?<=^No$)
但在那之后它没有捕获任何东西.我怎样才能让它排除某些我希望在其中包含句点的字符串,而不是捕获它们?
But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?
这是我的情况的正则表达式:https://regexr.com/4sgcb
Here is a regexr of my situation: https://regexr.com/4sgcb
推荐答案
只使用一个正则表达式会很棘手 - 正如评论中所述,有很多边缘情况.
Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.
我自己会分三步完成:
- 用一些特殊字符替换应该保留的空格 (
re.sub
) - 拆分文本(
re.split
) - 用空格替换特殊字符
例如:
import re
zero_width_space = '\u200B'
s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)
from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])
打印:
['I am from New York, N.Y. and I would like to say hello!',
'How are you today?',
'I am well.',
'I owe you $6. 00 because you bought me a No. 3 burger.',
'-Sgt. Smith']
这篇关于正则表达式匹配前面没有字符串的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!