正则表达式匹配 <content> 中的每个换行符 (\n)标签 [英] Regular Expression to match every new line character (\n) inside a <content> tag

查看:51
本文介绍了正则表达式匹配 <content> 中的每个换行符 (\n)标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个正则表达式来匹配 <content> 或任何标签内的 XML 标签内的每个换行符 (\n)它位于 标签内,例如:

I'm looking for a regular expression to match every new line character (\n) inside a XML tag which is <content>, or inside any tag which is inside that <content> tag, for example :

<blog>
<text>
(Do NOT match new lines here)
</text>
<content>
(DO match new lines here)
<p>
(Do match new lines here)
</p>
</content>
(Do NOT match new lines here)
<content>
(DO match new lines here)
</content>

推荐答案

实际上……您不能在这里使用简单的正则表达式,至少不能.您可能需要担心评论!有人可能会写:

Actually... you can't use a simple regex here, at least not one. You probably need to worry about comments! Someone may write:

<!-- <content> blah </content> -->

您可以在这里采取两种方法:

You can take two approaches here:

  1. 首先删除所有评论.然后使用正则表达式方法.
  2. 不要使用正则表达式,并使用上下文敏感的解析方法来跟踪您是否嵌套在注释中.

小心.

我也不太确定您是否可以一次匹配所有新行.@Quartz 推荐了这个:

I am also not so sure you can match all new lines at once. @Quartz suggested this one:

<content>([^\n]*\n+)+</content>

这将匹配在结束标记之前具有换行符的任何内容标签......但我不确定匹配所有换行符是什么意思.您是否希望能够访问所有匹配的换行符?如果是这样,最好的办法是获取所有内容标签,然后搜索嵌套在其间的所有换行符.更像这样的:

This will match any content tags that have a newline character RIGHT BEFORE the closing tag... but I'm not sure what you mean by matching all newlines. Do you want to be able to access all the matched newline characters? If so, your best bet is to grab all content tags, and then search for all the newline chars that are nested in between. Something more like this:

<content>.*</content>

但是有一个警告:正则表达式是贪婪的,所以这个正则表达式将匹配第一个开始标签和最后一个结束标签.相反,您必须抑制正则表达式,使其不贪婪.在像python这样的语言中,你可以用?"来做到这一点.正则表达式符号.

BUT THERE IS ONE CAVEAT: regexes are greedy, so this regex will match the first opening tag to the last closing one. Instead, you HAVE to suppress the regex so it is not greedy. In languages like python, you can do this with the "?" regex symbol.

我希望你能看到一些陷阱并弄清楚你想如何继续.您最好使用 XML 解析库,然后遍历所有内容标签.

I hope with this you can see some of the pitfalls and figure out how you want to proceed. You are probably better off using an XML parsing library, then iterating over all the content tags.

我知道我可能不会提供最好的解决方案,但至少我希望您能看到其中的困难以及为什么其他答案可能不正确...

I know I may not be offering the best solution, but at least I hope you will see the difficulty in this and why other answers may not be right...

更新 1:

让我再总结一下,并在我的回复中添加更多细节.我将使用 python 的正则表达式语法,因为它是我更习惯的语法(请原谅我提前...你可能需要转义一些字符...评论我的帖子,我会更正):

Let me summarize a bit more and add some more detail to my response. I am going to use python's regex syntax because it is what I am more used to (forgive me ahead of time... you may need to escape some characters... comment on my post and I will correct it):

要去除注释,请使用以下正则表达式:注意?"抑制 .* 以使其不贪婪.

To strip out comments, use this regex: Notice the "?" suppresses the .* to make it non-greedy.

同样,要搜索内容标签,请使用:.*?

Similarly, to search for content tags, use: .*?

此外,您可以尝试一下,并使用匹配对象 group() 访问每个换行符:

Also, You may be able to try this out, and access each newline character with the match objects groups():

<content>(.*?(\n))+.*?</content>

我知道我的转义已关闭,但它抓住了这个想法.最后一个例子可能行不通,但我认为这是表达你想要的最好的选择.我的建议仍然是:要么获取所有内容标签并自己做,要么使用解析库.

I know my escaping is off, but it captures the idea. This last example probably won't work, but I think it's your best bet at expressing what you want. My suggestion remains: either grab all the content tags and do it yourself, or use a parsing library.

更新 2:

这里是应该可以工作的python代码.我仍然不确定查找"所有换行符是什么意思.你想要整行吗?或者只是计算有多少换行符.要获取实际行,请尝试:

So here is python code that ought to work. I am still unsure what you mean by "find" all newlines. Do you want the entire lines? Or just to count how many newlines. To get the actual lines, try:

#!/usr/bin/python

import re

def FindContentNewlines(xml_text):
    # May want to compile these regexes elsewhere, but I do it here for brevity
    comments = re.compile(r"<!--.*?-->", re.DOTALL)
    content = re.compile(r"<content>(.*?)</content>", re.DOTALL)
    newlines = re.compile(r"^(.*?)$", re.MULTILINE|re.DOTALL)

    # strip comments: this actually may not be reliable for "nested comments"
    # How does xml handle <!--  <!-- --> -->. I am not sure. But that COULD
    # be trouble.
    xml_text = re.sub(comments, "", xml_text)

    result = []
    all_contents = re.findall(content, xml_text)
    for c in all_contents:
        result.extend(re.findall(newlines, c))

    return result

if __name__ == "__main__":
    example = """

<!-- This stuff
ought to be omitted
<content>
  omitted
</content>
-->

This stuff is good
<content>
<p>
  haha!
</p>
</content>

This is not found
"""
    print FindContentNewlines(example)

这个程序打印结果:

 ['', '<p>', '  haha!', '</p>', '']

第一个和最后一个空字符串来自第一个 <p> 之前的换行符和 </p> 之后的换行符.总而言之(在大多数情况下)可以解决问题.尝试使用此代码并根据您的需要对其进行优化.打印出中间的内容,以便您可以看到正则表达式匹配和不匹配的内容.

The first and last empty strings come from the newline chars immediately preceeding the first <p> and the one coming right after the </p>. All in all this (for the most part) does the trick. Experiment with this code and refine it for your needs. Print out stuff in the middle so you can see what the regexes are matching and not matching.

希望这会有所帮助:-).

Hope this helps :-).

PS - 我在第一次更新中尝试使用我的正则表达式来捕获所有换行符时没有多少运气......如果你这样做,请告诉我.

PS - I didn't have much luck trying out my regex from my first update to capture all the newlines... let me know if you do.

这篇关于正则表达式匹配 &lt;content&gt; 中的每个换行符 (\n)标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆