python regex:捕获包含空格的多个字符串的部分 [英] python regex: capture parts of multiple strings that contain spaces

查看:89
本文介绍了python regex:捕获包含空格的多个字符串的部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从看起来类似于

的字符串中捕获子字符串

'某个字符串,另一个字符串,'

我希望结果匹配组为

('某个字符串', '另一个字符串')

我目前的解决方案

<预><代码>>>>从重新导入匹配>>>match(2 * '(.*?), ', 'some string, another string, ').groups()('一些字符串','另一个字符串')

可行,但不切实际——与我在实际项目中所做的相比,我在这里展示的内容当然大大降低了复杂性;我只想使用一个直接"(非计算)正则表达式模式.不幸的是,到目前为止我的尝试都失败了:

这不匹配(结果为无),因为 {2} 仅应用于空格,而不是整个字符串:

<预><代码>>>>match('.*?, {2}', '一些字符串,另一个字符串,')

在重复字符串周围添加括号,结果中有逗号和空格

<预><代码>>>>match('(.*?, ){2}', '一些字符串,另一个字符串,').groups()('另一个字符串,',)

添加另一组括号确实解决了这个问题,但让我觉得太多了:

<预><代码>>>>match('((.*?), ){2}', '一些字符串,另一个字符串,').groups()('另一个字符串, ', '另一个字符串')

添加非捕获修饰符改善了结果,但仍然错过了第一个字符串

<预><代码>>>>match('(?:(.*?), ){2}', '一些字符串,另一个字符串,').groups()('另一个字符串',)

我觉得我很接近,但我似乎真的找不到合适的方法.

谁能帮帮我?我没有看到任何其他方法?

<小时>

在前几次回复后更新:

首先,非常感谢大家,非常感谢您的帮助!:-)

正如我在原帖中所说,为了描述实际的核心问题,我在问题中省略了很多复杂性.首先,在我正在从事的项目中,我正在解析大量文件(目前每天数万个)(目前为 5 个,很快约为 25 个,以后可能会达到数百个)不同的基于行的格式.还有 XML、JSON、二进制和其他一些数据文件格式,但让我们保持专注.

为了处理多种文件格式并利用其中许多基于行的事实,我创建了一个有点通用的 Python 模块,它一个接一个地加载一个文件,对每一行应用一个正则表达式,然后返回带有匹配项的大型数据结构.该模块是一个原型,出于性能原因,生产版本将需要 C++ 版本,该版本将通过 Boost::Python 连接,并且可能会将正则表达式方言的主题添加到复杂性列表中.

此外,没有 2 次重复,但数量在当前 0 到 70(左右)之间变化,逗号并不总是逗号,尽管我最初说过,正则表达式模式的某些部分必须计算在运行时;假设我有理由尝试减少动态"数量并尽可能多地使用固定"模式.

所以,一句话:我必须使用正则表达式.

<小时>

尝试改写:我认为问题的核心归结为:是否有 Python RegEx 表示法,例如涉及花括号重复并允许我捕获

'某个字符串,另一个字符串,'

进入

('某个字符串', '另一个字符串')

?

嗯,这可能把它缩小得太远了 - 但是,你这样做的任何方式都是错误的 :-D

<小时>

第二次尝试改写: 为什么我在结果中看不到第一个字符串(某个字符串")?为什么正则表达式会产生匹配(表明必须有 2 个),但只返回 1 个字符串(第二个)?

即使我使用非数字重复,即使用 + 而不是 {2},问题仍然存在:

<预><代码>>>>match('(?:(.*?), )+', '一些字符串,另一个字符串,').groups()('另一个字符串',)

另外,返回的不是第二个字符串,而是最后一个:

<预><代码>>>>match('(?:(.*?), )+', '一些字符串,另一个字符串,第三个字符串,').groups()('第三个字符串',)

再次感谢您的帮助,在试图找出我真正想知道的内容时,同行评审的帮助让我感到惊讶...

解决方案

为了总结这一点,我似乎已经通过以动态"方式构建正则表达式模式来使用最佳解决方案:

<预><代码>>>>从重新导入匹配>>>match(2 * '(.*?), ', 'some string, another string, ').groups()('一些字符串','另一个字符串')

2 * '(.*?)

就是我所说的动态.替代方法

<预><代码>>>>match('(?:(.*?), ){2}', '一些字符串,另一个字符串,').groups()('另一个字符串',)

由于(如 Glenn 和 Alan 亲切地解释)而未能返回所需的结果

<块引用>

匹配时,捕获的内容将被覆盖每次重复捕获组

感谢大家的帮助!:-)

I am trying to capture sub-strings from a string that looks similar to

'some string, another string, '

I want the result match group to be

('some string', 'another string')

my current solution

>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')

works, but is not practicable - what I am showing here of course is massively reduced in terms of complexity compared to what I'm doing in the real project; I want to use one 'straight' (non-computed) regex pattern only. Unfortunately, my attempts have failed so far:

This doesn't match (None as result), because {2} is applied to the space only, not to the whole string:

>>> match('.*?, {2}', 'some string, another string, ')

adding parentheses around the repeated string has the comma and space in the result

>>> match('(.*?, ){2}', 'some string, another string, ').groups()
('another string, ',)

adding another set of parantheses does fix that, but gets me too much:

>>> match('((.*?), ){2}', 'some string, another string, ').groups()
('another string, ', 'another string')

adding a non-capturing modifier improves the result, but still misses the first string

>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)

I feel like I'm close, but I can't really seem to find the proper way.

Can anyone help me ? Any other approaches I'm not seeing ?


Update after the first few responses:

First up, thank you very much everyone, your help is greatly appreciated! :-)

As I said in the original post, I have omitted a lot of complexity in my question for the sake of depicting the actual core problem. For starters, in the project I am working on, I am parsing large amounts of files (currently tens of thousands per day) in a number (currently 5, soon ~25, possibly in the hundreds later) of different line-based formats. There is also XML, JSON, binary and some other data file formats, but let's stay focussed.

In order to cope with the multitude of file formats and to exploit the fact that many of them are line-based, I have created a somewhat generic Python module that loads one file after the other, applies a regex to every line and returns a large data structure with the matches. This module is a prototype, the production version will require a C++ version for performance reason which will be connected over Boost::Python and will probably add the subject of regex dialects to the list of complexities.

Also, there are not 2 repetitions, but an amount varying between currently zero and 70 (or so), the comma is not always a comma and despite what I said originally, some parts of the regex pattern will have to be computed at runtime; let's just say I have reason to try and reduce the 'dynamic' amount and have as much 'fixed' pattern as possible.

So, in a word: I must use regular expressions.


Attempt to rephrase: I think the core of the problem boils down to: Is there a Python RegEx notation that e.g. involves curly braces repetitions and allows me to capture

'some string, another string, '

into

('some string', 'another string')

?

Hmmm, that probably narrows it down too far - but then, any way you do it is wrong :-D


Second attempt to rephrase: Why do I not see the first string ('some string') in the result ? Why does the regex produce a match (indicating there's gotta be 2 of something), but only returns 1 string (the second one) ?

The problem remains the same even if I use non-numeric repetition, i.e. using + instead of {2}:

>>> match('(?:(.*?), )+', 'some string, another string, ').groups()
('another string',)

Also, it's not the second string that's returned, it is the last one:

>>> match('(?:(.*?), )+', 'some string, another string, third string, ').groups()
('third string',)

Again, thanks for your help, never ceases to amaze me how helpful peer review is while trying to find out what I actually want to know...

解决方案

In order to sum this up, it seems I am already using the best solution by constructing the regex pattern in a 'dynamic' manner:

>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')

the

2 * '(.*?)

is what I mean by dynamic. The alternative approach

>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)

fails to return the desired result due to the fact that (as Glenn and Alan kindly explained)

with match, the captured content gets overwritten with each repetition of the capturing group

Thanks for your help everyone! :-)

这篇关于python regex:捕获包含空格的多个字符串的部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆