python-re.findall 如何将内容分组 [英] python- re.findall how to separate content into groups
本文介绍了python-re.findall 如何将内容分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要澄清一下 re.findall 方法中的正则表达式是如何工作的.
I need some clarification on how regex with the re.findall method works.
pattern = re.compile(r'(?<=\\\\\[-16pt]\n)([\s\S]*?)(?=\\\\\n\\thinhline)')
content= ' '.join(re.findall(pattern, content))
所以上面打印了模式与开始匹配的所有内容: \\[-16pt]
和结尾是 '\ \n thinhline' 加上它后面的所有文本.如果我有以下与模式匹配的内容:
So the above prints all the content that the pattern matches with the starting being: \\[-16pt]
and the ending being '\ \n thinhline' plus all the text after it.
If I had the following content that matched by the pattern:
\\[-16pt]
x = 10
print ("hi")
\\
\thinhline
\\[-16pt]
y = 3
print ("bye")
\\
\thinhline
\\[-16pt]
z = 7
print ("zap")
\\
\thinhline
This is random text.
All of this is matched by re.findall, even though it is not included within the pattern.
xyz = "xyz"
我如何将每个组分开,以便我可以独立编辑它们:
How would I separate out each group so I could have, for example and be able to edit them independently:
第 1 组:
x = 10
print ("hi")
第 2 组:
y = 3
print ("bye")
第 3 组:
z = 7
print ("zap")
之后没有匹配的额外内容吗?
and none of the extra stuff that is matched after it?
谢谢.
推荐答案
考虑以下可运行程序:
import re
content="""\\[-16pt]
x = 10
print ("hi")
\\
thinhline
\\[-16pt]
y = 3
print ("bye")
\\
thinhline
\\[-16pt]
z = 7
print ("zap")
\\
thinhline
This is random text.
"""
pattern = re.compile(r"""(\\\[-16pt]\n) # Start. Don't technically need to capture.
(.*?) # What we want. Must capture ;)
(\n\\\nthinhline) # End. Also don't really need to capture
""", re.X | re.DOTALL)
for m in pattern.finditer(content):
print("Matched:\n----\n%s\n----\n" % m.group(2))
运行时输出:
Matched:
----
x = 10
print ("hi")
----
Matched:
----
y = 3
print ("bye")
----
Matched:
----
z = 7
print ("zap")
----
注意事项:
- 通过使用
re.X
选项,表达式可以是多行和注释 - 通过使用
re.DOTALL
选项,可以删除过多的反斜杠和.*?
"组(即非贪婪地获取每个字符,直到下一场比赛")将包括换行符. - 我使用了
finditer
而不是findall
...从你的问题来看,但你想处理每场比赛,所以我想通了是一个很好的方法. - 我从
thinhline
上取下了标签\t
因为我不确定它是否是意味着是制表符或反冲然后-t.对上述影响不大但只是想说清楚. - 我捕获开始组和结束组仅用于演示.只有中间群真的很需要.
- By using the
re.X
option the expression can be multiline and commented - By using the
re.DOTALL
option the excessive backslashes can be dropped and the ".*?
" group (i.e. "get every character non-greedily up until the next match") will include newlines. - I used
finditer
rather thanfindall
... which technically moves away from your question, but you wanted to work with each match so I figured it was a good approach. - I took the tab
\t
off thethinhline
because I wasn't sure if it was meant to be a tab char or a backlash-then-t. Not that affects the above much but just wanted to be clear. - I capture the start and end groups only for demonstration. Only the middle group is really needed.
这篇关于python-re.findall 如何将内容分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文