使用 re 在文档过程中捕获关键字之间的文本 [英] Using re to capture text between key words over the course of a doc

查看:32
本文介绍了使用 re 在文档过程中捕获关键字之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在文档中的关键字和关键字本身之间捕获文本.

I am trying to capture text between key words in a document and the keys words themselves.

例如,假设我在一个字符串中有多个egg"实例.我想捕捉蛋"和蛋"之间的每一个作品.

For example, let's say I have multiple instances of "egg" in a string. I want to capture each work between "egg" and "egg."

我已经尝试过:

import re
text = "egg hashbrowns egg bacon egg fried milk egg"
re.findall(r"(/egg) (.*) (/egg)", text)

我也尝试过 re.matchre.search.

我通常得到的是("egg"), ("hashbrowns egg bacon egg Fried milk"), ("egg")

我需要的是(egg, hashbrown, egg), (egg, bacon egg), (egg, Fr​​ied, Milk, Egg).

我将不胜感激.

推荐答案

您需要使用非贪婪匹配.*?* 的非贪婪形式,匹配最小可能的序列.此外,/egg 完全匹配,但我假设您只想要 egg,因此您的实际正则表达式变为 (egg) (.*?) (egg).但是,由于正则表达式在匹配时使用字符串,因此您需要使用前瞻和后视断言来匹配中间文本.在这种情况下,(?<=egg) (.*?) (?=egg) 查找前后带有egg"的文本,但只返回中间的内容,即 ['hashbrowns'、'培根'、'炸牛奶'].尝试匹配egg"也会复杂得多,并且可能涉及对字符串进行两次解析,因此只有在您真正想要的情况下才值得深入研究.

You need to use a non-greedy match. The *? is a non-greedy form of *, and matches the smallest possible sequence. Also, /egg matches exactly that, but I assume you just want egg, so your actual regex becomes (egg) (.*?) (egg). However, since regular expressions consume the string as it is matched, you need to use look-ahead and look-behind assertions to match the intermediate text. In this case, (?<=egg) (.*?) (?=egg) finds text with "egg" before and after, but only returns the inbetween stuff, i.e. ['hashbrowns', 'bacon', 'fried milk']. Trying to match "egg" too would be quite a lot more complicated, and would probably involve parsing the string twice, so its only worth going into it if that's actually what you want.

所有这些都记录在 python 文档中,因此请在那里查找更多信息信息.

All this is documented in the python docs, so look there for more info.

这篇关于使用 re 在文档过程中捕获关键字之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆