python-re.findall 如何将内容分组 [英] python- re.findall how to separate content into groups

查看:58
本文介绍了python-re.findall 如何将内容分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要澄清一下 re.findall 方法中的正则表达式是如何工作的.

I need some clarification on how regex with the re.findall method works.

pattern = re.compile(r'(?<=\\\\\[-16pt]\n)([\s\S]*?)(?=\\\\\n\\thinhline)')
content= ' '.join(re.findall(pattern, content))

所以上面打印了模式与开始匹配的所有内容: \\[-16pt] 和结尾是 '\ \n thinhline' 加上它后面的所有文本.如果我有以下与模式匹配的内容:

So the above prints all the content that the pattern matches with the starting being: \\[-16pt] and the ending being '\ \n thinhline' plus all the text after it. If I had the following content that matched by the pattern:

\\[-16pt]
x = 10
print ("hi")
\\
\thinhline
\\[-16pt]
y = 3
print ("bye")
\\
\thinhline
\\[-16pt]
z = 7
print ("zap")
\\
\thinhline
This is random text.
All of this is matched by re.findall, even though it is not included within the pattern.
xyz = "xyz"

我如何将每个组分开,以便我可以独立编辑它们:

How would I separate out each group so I could have, for example and be able to edit them independently:

第 1 组:

x = 10
print ("hi")

第 2 组:

y = 3
print ("bye")

第 3 组:

z = 7
print ("zap")

之后没有匹配的额外内容吗?

and none of the extra stuff that is matched after it?

谢谢.

推荐答案

考虑以下可运行程序:

import re

content="""\\[-16pt]
x = 10
print ("hi")
\\
thinhline
\\[-16pt]
y = 3
print ("bye")
\\
thinhline
\\[-16pt]
z = 7
print ("zap")
\\
thinhline
This is random text.
"""

pattern = re.compile(r"""(\\\[-16pt]\n)    # Start. Don't technically need to capture.
                         (.*?)             # What we want. Must capture ;)
                         (\n\\\nthinhline) # End. Also don't really need to capture
                      """, re.X | re.DOTALL)

for m in pattern.finditer(content):
    print("Matched:\n----\n%s\n----\n" % m.group(2))

运行时输出:

Matched:
----
x = 10
print ("hi")
----

Matched:
----
y = 3
print ("bye")
----

Matched:
----
z = 7
print ("zap")
----

注意事项:

  • 通过使用 re.X 选项,表达式可以是多行和注释
  • 通过使用 re.DOTALL 选项,可以删除过多的反斜杠和.*?"组(即非贪婪地获取每个字符,直到下一场比赛")将包括换行符.
  • 我使用了 finditer 而不是 findall ...从你的问题来看,但你想处理每场比赛,所以我想通了是一个很好的方法.
  • 我从 thinhline 上取下了标签 \t 因为我不确定它是否是意味着是制表符或反冲然后-t.对上述影响不大但只是想说清楚.
  • 我捕获开始组和结束组仅用于演示.只有中间群真的很需要.
  • By using the re.X option the expression can be multiline and commented
  • By using the re.DOTALL option the excessive backslashes can be dropped and the ".*?" group (i.e. "get every character non-greedily up until the next match") will include newlines.
  • I used finditer rather than findall ... which technically moves away from your question, but you wanted to work with each match so I figured it was a good approach.
  • I took the tab \t off the thinhline because I wasn't sure if it was meant to be a tab char or a backlash-then-t. Not that affects the above much but just wanted to be clear.
  • I capture the start and end groups only for demonstration. Only the middle group is really needed.

这篇关于python-re.findall 如何将内容分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆