扩展的正则表达式实现可以解析 HTML 吗? [英] Can extended regex implementations parse HTML?

查看:43
本文介绍了扩展的正则表达式实现可以解析 HTML 吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道你在想什么 - 哦,我的天啊,说真的,不要再说了" - 但请耐心等待,我的问题不仅仅是标题.在我们开始之前,我保证我永远不会尝试使用正则表达式解析任意 HTML,也不会询问其他人如何.

I know what you're thinking - "oh my god, seriously, not again" - but please bear with me, my question is more than the title. Before we begin, I promise I will never try to parse arbitrary HTML with a regex, or ask anyone else how.

这里有很多很多答案解释了为什么你不能这样做依赖于正则表达式的正式定义.他们解析常规语言,HTML 是上下文无关的但不是常规的,所以你不能这样做.但我也听说各种语言的许多正则表达式实现并不严格;它们附带了超出正式正则表达式范围的额外技巧.

All of the many, many answers here explaining why you cannot do this rely on the formal definition of regular expressions. They parse regular languages, HTML is context-free but not regular, so you can't do it. But I have also heard that many regex implementations in various languages are not strictly regular; they come with extra tricks that break outside the bounds of formal regular expressions.

由于我不知道任何特定实现的细节,例如 perl,我的问题是:

Since I don't know the details of any particular implementations, such as perl, my questions are:

  1. 正则表达式工具的哪些功能是非常规的?是反向引用吗?以及它们以哪些语言被发现?
  2. 这些额外的技巧是否足以解析所有上下文无关语言?
  3. 如果对 #2 为否",那么这些额外功能是否准确涵盖了正式的语言类别或类别?我们如何快速知道我们试图解决的问题是否在我们不必要的正则表达式的能力范围内?

推荐答案

您的问题的答案是是,所谓的扩展正则表达式"——也许比常规更恰当地称为模式正式意义上的表达式——例如在 Perl 和 PCRE 中发现的那些是确实能够递归下降解析上下文无关文法.

The answer to your question is that yes, so-called "extended regexes" — which are perhaps more properly called patterns than regular expressions in the formal sense — such as those found in Perl and PCRE are indeed capable of recursive descent parsing of context-free grammars.

此贴 一对方法说明了将正则表达式应用于 X/HTML 的理论限制,而实际上限制了.那里给出的第一种方法,标记为 naïve 的方法,更像是您在进行此类尝试的大多数程序中容易找到的那种方法.这可以在定义明确的、非通用的 X/HTML 上工作,通常只需很少的努力.那是它最好的应用程序,就像开放式 X/HTML 是它最糟糕的应用程序一样.

This posting’s pair of approaches illustrate not so much theoretical as rather practical limits to applying regexes to X/HTML. The first approach given there, the one labelled naïve, is more like the sort you are apt to find in most programs that make such an attempt. This can be made to work on well-defined, non-generic X/HTML, often with very little effort. That is its best application, just as open-ended X/HTML is its worst.

第二种方法,标记为向导,使用实际语法进行解析.因此,它与任何其他语法方法一样强大.然而,它也远远超出了绝大多数休闲程序员的能力.它还冒着重新创造一个完美的轮子以获得负面利益的风险.我写它是为了展示可以做什么,但实际上在任何情况下永远都不应该做.我想向人们展示为什么他们想要在开放式 X/HTML 上使用解析器,向他们展示即使使用目前可用的一些最强大的模式匹配工具,要接近正确是多么困难.

The second approach, labelled wizardly, uses an actual grammar for parsing. As such, it is fully as powerful as any other grammatical approach. However, it is also far beyond the powers of the overwhelming majority of casual programmers. It also risks re-creating a perfectly fine wheel for negative benefit. I wrote it to show what can be done, but which under virtually no circumstances whatsoever ever should be done. I wanted to show people why they want to use a parser on open-ended X/HTML by showing them how devilishly hard it is to come even close to getting right even using some of the most powerful of pattern-matching facilities currently available.

许多人误读了我的帖子,认为在某种程度上与我实际所说的相反.请不要误会:我是说它使用起来太复杂了.这是反例的证明.我曾希望通过展示如何使用正则表达式来做到这一点,人们会意识到他们为什么不想想走这条路.虽然一切皆有可能,但并非所有事情都是权宜之计.

Many have misread my posting as somehow advocating the opposite of what I am actually saying. Please make no mistake: I’m saying that it is far too complicated to use. It is a proof by counter-example. I had hoped that by showing how to do it with regexes, people would realize why they did not want to go down that road. While all things are possible, not all are expedient.

我个人的经验法则是,如果所需的正则表达式只是第一类,我可能会使用它,但如果它需要第二类的完全语法处理,我会使用其他人已经编写的解析器.所以即使我可以写一个解析器,我认为没有理由这样做,而且很多不这样做.

My personal rule of thumb is that if the required regex is of only the first category, I may well use it, but that if it requires the fully grammatical treatment of the second category, I use someone else’s already-written parser. So even though I can write a parser, I see no reason to do so, and plenty not to.

当为这个明确的目的精心制作时,模式比现成的解析器更能抵抗格式错误的 X/HTML,特别是如果你没有真正的机会破解上述解析器使它们对 Web 浏览器倾向于容忍但验证器不能容忍的常见故障情况更具弹性.然而,我在上面提供的语法模式只为格式良好但合理通用的 HTML 设计(尽管没有实体替换,这很容易添加).解析器中的错误恢复完全是一个单独的问题,绝不是一个令人愉快的问题.

When carefully crafted for that explicit purpose, patterns can be more resisilient to malformed X/HTML than off-the-shelf parsers tend to be, particularly if you have no real opportunity to hack on said parsers to make them more resilient to the common failure cases that web browsers tend to tolerate but validators do not. However, the grammatical patterns I provide above were designed for only well-formed but reasonably generic HTML (albeit without entity replacement, which is easily enough added). Error recovery in parsers is a separate issue altogether, and by no means a pleasant one.

模式,尤其是大多数人习惯看到和使用的更常见的非语法模式,更适合一次抓取一个离散的块,而不是生成完整的句法分析.换句话说,正则表达式通常在词法分析方面比在解析方面更有效.如果没有语法正则表达式,您不应该尝试解析语法.

Patterns, especially the far more commonplace non-grammatical ones most people are used to seeing and using, are much better suited for grabbing up discrete chunks one at a time than they are for producing a full syntactic analysys. In other words, regexes usually work better for lexing than they do for parsing. Without grammatical regexes, you should not try parsing grammars.

但不要走得太远.我当然不是要暗示您应该立即转向成熟的解析器,因为您想解决递归定义的问题.这种事情最简单也可能是最常见的例子是检测嵌套项的模式,比如括号.对我来说,在我的代码中简单地写下这样的东西并完成它是非常常见的:

But don’t take that too far. I certainly do not mean to imply that you should immediately turn to a full-blown parser just because you want to tackle something that is recursively defined. The easiest and perhaps most commonly seen example of this sort of thing is a pattern to detect nested items, like parentheses. It’s extremely common for me to just plop down something simple like this in my code, and be done with it:

# delete all nested parens
s/\((?:[^()]*+|(?0))*\)//g;

这篇关于扩展的正则表达式实现可以解析 HTML 吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆