Python 正则表达式中可变宽度后视的替代方案 [英] Alternatives to variable-width lookbehind in Python regex

查看:37
本文介绍了Python 正则表达式中可变宽度后视的替代方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近决定深入 Python 池并开始将我的一些 R 代码转换为 Python,但我坚持做一些对我来说非常重要的事情.在我的工作中,我花费了大量时间来解析文本数据,众所周知,文本数据非常非结构化.因此,我开始依赖正则表达式的环视功能,而 R 的环视功能非常强大.例如,如果我解析的 PDF 可能在我对文件进行 OCR 时在字母之间引入一些空格,我会得到我想要的值,如下所示:

I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:

oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")

在 Python 中,这是不可能的,因为使用 ? 使后视成为可变宽度表达式而不是固定宽度.这个功能对我来说非常重要,它阻止了我想要使用 Python,但我不想放弃这种语言,我想知道 Pythonista 解决这个问题的方法.我是否必须在提取文本之前对字符串进行预处理?像这样:

In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:

oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)

有没有更有效的方法来做到这一点?因为虽然这个例子是微不足道的,但这个问题会以非常复杂的方式出现在我处理的数据中,我不想对我分析的每一行文本进行这种预处理.

Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.

最后,如果这不是问这个问题的合适地方,我深表歉意;我不知道还有什么地方可以张贴.提前致谢.

Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.

推荐答案

您需要使用 捕获组 在这种情况下,您描述了:

You need to use capture groups in this case you described:

"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"

会变成

r"ORIG\s?:\s?/\s?([A-Z0-9]+)"

该值将在 .group(1) 中.请注意,首选原始字符串.

The value will be in .group(1). Note that raw strings are preferred.

这是一个示例代码:

import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)

IDEONE 演示

除非您需要重叠匹配,否则捕获组使用情况而不是后视是相当简单的.

Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.

这篇关于Python 正则表达式中可变宽度后视的替代方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆