python-re.findall 如何将内容分组 [英] python- re.findall how to separate content into groups

查看：58 发布时间：2021/6/26 20:11:53 python regex python-2.7

本文介绍了python-re.findall 如何将内容分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要澄清一下 re.findall 方法中的正则表达式是如何工作的.

I need some clarification on how regex with the re.findall method works.

pattern = re.compile(r'(?<=\\\\\[-16pt]\n)([\s\S]*?)(?=\\\\\n\\thinhline)')
content= ' '.join(re.findall(pattern, content))

所以上面打印了模式与开始匹配的所有内容: \\[-16pt] 和结尾是 '\ \n thinhline' 加上它后面的所有文本.如果我有以下与模式匹配的内容:

So the above prints all the content that the pattern matches with the starting being: \\[-16pt] and the ending being '\ \n thinhline' plus all the text after it. If I had the following content that matched by the pattern:

\\[-16pt]
x = 10
print ("hi")
\\
\thinhline
\\[-16pt]
y = 3
print ("bye")
\\
\thinhline
\\[-16pt]
z = 7
print ("zap")
\\
\thinhline
This is random text.
All of this is matched by re.findall, even though it is not included within the pattern.
xyz = "xyz"

我如何将每个组分开，以便我可以独立编辑它们:

How would I separate out each group so I could have, for example and be able to edit them independently:

第 1 组:

x = 10
print ("hi")

第 2 组:

y = 3
print ("bye")

第 3 组:

z = 7
print ("zap")

之后没有匹配的额外内容吗?

and none of the extra stuff that is matched after it?

谢谢.

推荐答案

考虑以下可运行程序:

import re

content="""\\[-16pt]
x = 10
print ("hi")
\\
thinhline
\\[-16pt]
y = 3
print ("bye")
\\
thinhline
\\[-16pt]
z = 7
print ("zap")
\\
thinhline
This is random text.
"""

pattern = re.compile(r"""(\\\[-16pt]\n)    # Start. Don't technically need to capture.
                         (.*?)             # What we want. Must capture ;)
                         (\n\\\nthinhline) # End. Also don't really need to capture
                      """, re.X | re.DOTALL)

for m in pattern.finditer(content):
    print("Matched:\n----\n%s\n----\n" % m.group(2))

运行时输出:

Matched:
----
x = 10
print ("hi")
----

Matched:
----
y = 3
print ("bye")
----

Matched:
----
z = 7
print ("zap")
----

注意事项:

通过使用 re.X 选项，表达式可以是多行和注释
通过使用 re.DOTALL 选项，可以删除过多的反斜杠和.*?"组(即非贪婪地获取每个字符，直到下一场比赛")将包括换行符.
我使用了 finditer 而不是 findall ...从你的问题来看，但你想处理每场比赛，所以我想通了是一个很好的方法.
我从 thinhline 上取下了标签 \t 因为我不确定它是否是意味着是制表符或反冲然后-t.对上述影响不大但只是想说清楚.
我捕获开始组和结束组仅用于演示.只有中间群真的很需要.

By using the re.X option the expression can be multiline and commented
By using the re.DOTALL option the excessive backslashes can be dropped and the ".*?" group (i.e. "get every character non-greedily up until the next match") will include newlines.
I used finditer rather than findall ... which technically moves away from your question, but you wanted to work with each match so I figured it was a good approach.
I took the tab \t off the thinhline because I wasn't sure if it was meant to be a tab char or a backlash-then-t. Not that affects the above much but just wanted to be clear.
I capture the start and end groups only for demonstration. Only the middle group is really needed.

这篇关于python-re.findall 如何将内容分组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

python-re.findall 如何将内容分组 [英] python- re.findall how to separate content into groups

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

python-re.findall 如何将内容分组 [英] python- re.findall how to separate content into groups

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭