在python中使用多个正则表达式提取特定文本? [英] extract specific text using multiple regex in python?

查看:440
本文介绍了在python中使用多个正则表达式提取特定文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python 3中使用正则表达式时遇到问题,因此如果有人可以帮助我,我将非常高兴.我有一个文本文件,如下所示:

Header A
text text
text text
Header B
text text
text text
Header C
text text
here is the end

我想做的是在标题之间包含一个文本列表,但包括标题本身. 我正在使用以下正则表达式:

 re.findall(r'(?=(Header.*?Header|Header.*?end))',data, re.DOTALL)

结果在这里

['Header A\ntext text\n text text\n Header', 'Header B\ntext text\n text text\n Header', 'Header C\n text text here is the end']

问题是,我在列表中每个项目的末尾都获得了下一个标头.如您所见,当我们找到下一个标头时,每个标头都结束了,但是最后一个标头并未以特定方式结束

有没有办法使用正则表达式获取每个标头的列表(而不是元组),包括其自身的文本作为子字符串?

解决方案

Header [^\n]*[\s\S]*?(?=Header|$)

尝试一下.请参见演示.

https://regex101.com/r/iS6jF6/21

import re
p = re.compile(r'Header [^\n]*[\s\S]*?(?=Header|$)')
test_str = "Header A\ntext text\ntext text\nHeader B\ntext text\ntext text\nHeader C\ntext text\nhere is the end"

re.findall(p, test_str)

I have a problem using regular expressions in python 3 so I would be gladful if someone could help me. I have a text file like the one below:

Header A
text text
text text
Header B
text text
text text
Header C
text text
here is the end

what I would like to do is to have a list of the text between the headers but including the headers themselves. I am using this regular expression:

 re.findall(r'(?=(Header.*?Header|Header.*?end))',data, re.DOTALL)

the result is here

['Header A\ntext text\n text text\n Header', 'Header B\ntext text\n text text\n Header', 'Header C\n text text here is the end']

The thing is that I get the next header in the end of the every item in the list. As you can see every header ends when we find the next header but the last header doesn't end in a specific way

Is there a way to get a list (not tuple) of every header including its own text as substrings using regular expressions?

解决方案

Header [^\n]*[\s\S]*?(?=Header|$)

Try this.See demo.

https://regex101.com/r/iS6jF6/21

import re
p = re.compile(r'Header [^\n]*[\s\S]*?(?=Header|$)')
test_str = "Header A\ntext text\ntext text\nHeader B\ntext text\ntext text\nHeader C\ntext text\nhere is the end"

re.findall(p, test_str)

这篇关于在python中使用多个正则表达式提取特定文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆