为什么用正则表达式解析 XML 是个坏主意? [英] Why is it such a bad idea to parse XML with regex?

查看:47
本文介绍了为什么用正则表达式解析 XML 是个坏主意?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚回顾了我之前发表的一篇文章,并注意到很多人建议我不使用 Regex 来解析 xml.在那种情况下,xml 相对简单,Regex 不会造成任何问题.我还解析了许多其他代码格式,所以为了统一起见,这是有道理的.但我很好奇这在其他情况下会如何造成问题.这只是不要重新发明轮子"类型的问题吗?

解决方案

真正的麻烦在于嵌套标签.嵌套标签很难用正则表达式处理.平衡匹配是可能的,但这仅在 .NET 和也许还有其他几种口味.但即使拥有平衡匹配的强大功能,放置不当的注释也可能会抛弃正则表达式.

例如,这是一个很难解析的...

<div id="parse-this"><!--哎呀</div>-->尝试使用正则表达式获取此值

您可能会用正则表达式在这样的边缘情况下追逐数小时,也许会找到解决方案.但实际上,如果有专门的 XML、XHTML 和 HTML 解析器可以更可靠、更高效地完成这项工作,那就毫无意义了.

I was just reviewing a previous post I made and noticed a number of people suggesting that I don't use Regex to parse xml. In that case the xml was relatively simple, and Regex didn't pose any problems. I was also parsing a number of other code formats, so for the sake of uniformity it made sense. But I'm curious how this might pose a problem in other cases. Is this just a 'don't reinvent the wheel' type of issue?

解决方案

The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It's possible with balanced matching, but that's only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.

For example, this is a tricky one to parse...

<div>
    <div id="parse-this">
        <!-- oops</div> -->
        try to get this value with regex
    </div>
</div>

You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there's no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.

这篇关于为什么用正则表达式解析 XML 是个坏主意?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆