Python re.finditer match.groups() 不包含匹配中的所有组 [英] Python re.finditer match.groups() does not contain all groups from match

查看:21
本文介绍了Python re.finditer match.groups() 不包含匹配中的所有组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Python 中使用正则表达式从多行搜索中查找和打印所有匹配的行.我正在搜索的文本可能具有以下示例结构:

<前>AAAABC1ABC2ABC3AAAABC1ABC2ABC3ABC4美国广播公司AAAABC1AAA

我想从中检索至少出现一次并以 AAA 开头的 ABC*.

问题是,尽管小组抓住了我想要的东西:

match = <_sre.SRE_Match 对象;span=(19, 38), match='AAA\nABC2\nABC3\nABC4\n'>

...我只能访问该组的最后一场比赛:

匹配组 = ('AAA\n', 'ABC4\n')

下面是我用来解决这个问题的示例代码.

#!Python导入系统进口重新导入操作系统string = "AAA\nABC1\nABC2\nABC3\nAAA\nABC1\nABC2\nABC3\nABC4\nABC\nAAA\nABC1\nAAA\n"打印(字符串)p_MATCHES = []p_MATCHES.append((re.compile('(AAA\n)(ABC[0-9]\n){1,}')) #匹配 = re.finditer(p_MATCHES[0],string)对于比赛中的比赛:斯特劳 = ''gr_iter=0打印(匹配=+str(匹配))打印(匹配组="+str(match.groups()))对于 match.groups() 中的组:gr_iter+=1sys.stdout.write("TEST GROUP:"+str(gr_iter)+"\t"+group) # 测试输出如果组不是无:如果组 != '':strout+= '"'+group.replace("\n","",1)+'"'+'\n'sys.stdout.write("\n完整结果:\n"+strout+"====\n")

解决方案

这是你的正则表达式:

(AAA\r\n)(ABC[0-9]\r\n){1,}

我已将正在重复的内容"部分 (ABC[0-9]\r\n) 放入 非捕获组.(我也已停止捕获 AAA,因为您似乎不需要它.)

捕获的文本可以在换行符处拆分,并根据需要为您提供所有部分.

(请注意,\n 本身在 Debuggex 中不起作用.它需要 \r\n.)

<小时>

这是一种解决方法.没有多少正则表达式风格提供迭代重复捕获的能力(哪些......?).更正常的方法是在找到每个匹配项时循环并处理.下面是 Java 的一个例子:

 import java.util.regex.*;公共类 RepeatingCaptureGroupsDemo {公共静态无效主(字符串 [] args){String input = "我有一只猫,但我更喜欢我的狗.";Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");匹配器 m = p.matcher(input);而(m.find()){System.out.println(m.group());}}}

输出:

猫狗

(来自 http://ocpsoft.org/opensource/guide-to-regular-expressions-in-java-part-1/,大约下降了 1/4)

<小时>

请考虑将堆栈溢出正则表达式常见问题解答加入书签以供将来参考.此答案中的链接来自它.

I am trying to use regex in Python to find and print all matching lines from a multiline search. The text that I am searching through may have the below example structure:

AAA
ABC1
ABC2
ABC3
AAA
ABC1
ABC2
ABC3
ABC4
ABC
AAA
ABC1
AAA

From which I want to retrieve the ABC*s that occur at least once and are preceeded by an AAA.

The problem is, that despite the group catching what I want:

match = <_sre.SRE_Match object; span=(19, 38), match='AAA\nABC2\nABC3\nABC4\n'>

... I can access only the last match of the group:

match groups = ('AAA\n', 'ABC4\n')

Below is the example code that I use for this problem.

#! python
import sys
import re
import os

string = "AAA\nABC1\nABC2\nABC3\nAAA\nABC1\nABC2\nABC3\nABC4\nABC\nAAA\nABC1\nAAA\n"
print(string)

p_MATCHES = []
p_MATCHES.append( (re.compile('(AAA\n)(ABC[0-9]\n){1,}')) ) #   
matches = re.finditer(p_MATCHES[0],string)

for match in matches:
    strout = ''
    gr_iter=0
    print("match = "+str(match))
    print("match groups = "+str(match.groups()))
    for group in match.groups():
    gr_iter+=1
    sys.stdout.write("TEST GROUP:"+str(gr_iter)+"\t"+group) # test output
    if group is not None:
        if group != '':
            strout+= '"'+group.replace("\n","",1)+'"'+'\n'
sys.stdout.write("\nCOMPLETE RESULT:\n"+strout+"====\n")

解决方案

Here is your regular expression:

(AAA\r\n)(ABC[0-9]\r\n){1,}

Debuggex Demo

Your goal is to capture all ABC#s that immediately follow AAA. As you can see in this Debuggex demo, all ABC#s are indeed being matched (they're highlighted in yellow). However, since only the "what is being repeated" part

ABC[0-9]\r\n

is being captured (is inside the parentheses), and its quantifier,

{1,}

is not being captured, this therefore causes all matches except the final one to be discarded. To get them, you must also capture the quantifier:

AAA\r\n((?:ABC[0-9]\r\n){1,})

Debuggex Demo

I've placed the "what is being repeated" part (ABC[0-9]\r\n) into a non-capturing group. (I've also stopped capturing AAA, as you don't seem to need it.)

The captured text can be split on the newline, and will give you all the pieces as you wish.

(Note that \n by itself doesn't work in Debuggex. It requires \r\n.)


This is a workaround. Not many regular expression flavors offer the capability of iterating through repeating captures (which ones...?). A more normal approach is to loop through and process each match as they are found. Here's an example from Java:

   import java.util.regex.*;

public class RepeatingCaptureGroupsDemo {
   public static void main(String[] args) {
      String input = "I have a cat, but I like my dog better.";

      Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
      Matcher m = p.matcher(input);

      while (m.find()) {
         System.out.println(m.group());
      }
   }
}

Output:

cat
dog

(From http://ocpsoft.org/opensource/guide-to-regular-expressions-in-java-part-1/, about a 1/4 down)


Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. The links in this answer come from it.

这篇关于Python re.finditer match.groups() 不包含匹配中的所有组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆