提取两个标记之间的所有子字符串 [英] Extract all substrings between two markers

查看:225
本文介绍了提取两个标记之间的所有子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串:

mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"

我想要的是标记start="&maker1"end="/\n"之间的子字符串列表.因此,预期结果是:

What I want is a list of substrings between the markers start="&maker1" and end="/\n". Thus, the expected result is:

whatIwant = ["The String that I want", "Another string that I want"]

我在这里阅读了答案:

  1. 在两个子字符串之间查找字符串[重复]
  2. 如何提取两个标记之间的子字符串?
  1. Find string between two substrings [duplicate]
  2. How to extract the substring between two markers?

并尝试了此尝试,但未成功

And tried this but not successfully,

>>> import re
>>> mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> whatIwant = re.search("&marker1(.*)/\n", mystr)
>>> whatIwant.group(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

我该怎么做才能解决此问题?而且,我的字符串很长

What could I do to resolve this? Also, I have a very long string

>>> len(myactualstring)
7792818

推荐答案

该如何解决? 我会的:

import re
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
found = re.findall(r"\&marker1\n(.*?)/\n", mystr)
print(found)

输出:

['The String that I want ', 'Another string that I want ']

请注意:

    如果需要文字&,
  • &re模式中具有特殊含义.您需要对其进行转义(\&)
  • .匹配除换行符之外的所有内容
  • 如果只想要匹配的子字符串列表而不是search ,则
  • findall更适合选择
  • *?是非贪婪的,在这种情况下.*也可以工作,因为.与换行符不匹配,但是在其他情况下,匹配结束可能会超出您的期望
  • 我使用了所谓的raw-string(r前缀)使转义变得更容易
  • & has special meaning in re patterns, if you want literal & you need to escape it (\&)
  • . does match anything except newlines
  • findall is better suited choiced if you just want list of matched substrings, rather than search
  • *? is non-greedy, in this case .* would work too, because . do not match newline, but in other cases you might ending matching more than you wish
  • I used so-called raw-string (r-prefixed) to make escaping easier

阅读模块re 文档讨论原始字符串的用法以及具有特殊含义的隐式字符列表.

Read module re documentation for discussion of raw-string usage and implicit list of characters with special meaning.

这篇关于提取两个标记之间的所有子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆