为什么无法使用正则表达式解析HTML/XML:用外行的术语进行的正式解释 [英] Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

查看:142
本文介绍了为什么无法使用正则表达式解析HTML/XML:用外行的术语进行的正式解释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SO上没有任何一天可以通过解析正则表达式来解析(X)HTML或XML的问题了.

尽管想出可以证明正则表达式不可行的示例或带有表达式集合以表示概念,所以我仍然找不到正式的解释,为什么不能用外行的话来做到这一点.

到目前为止,我在本网站上唯一能找到的正式解释可能是非常准确的,但对于自学成才的程序员来说还是很神秘的:

这里的缺陷是HTML是Chomsky Type 2语法(上下文无关) 语法)和RegEx是Chomsky Type 3语法(正则表达式)

或:

正则表达式只能匹配正则语言,但是HTML是 上下文无关的语言.

或:

有限自动机(这是常规的基础数据结构 表达式)除了其所处的状态外没有其他内存,如果 您有任意深度的嵌套,您需要任意大的嵌套 自动机,它与有限自动机的概念相冲突.

或:

常规语言的抽水引理是您不能做的原因 那个.

[为了公平起见:以上解释的大部分都链接到Wikipedia页面,但是比答案本身更容易理解].

所以我的问题是:请问有人可以用外行的方式对上述为什么不能使用正则表达式解析(X)HTML/XML的正式解释进行翻译吗? >

编辑:阅读完第一个答案后,我认为我应该澄清一下:我正在寻找一种翻译",该翻译也简要地解释了它尝试翻译的概念:在回答的最后,读者应该对常规语言"和无上下文语法"的含义有一个大概的了解.

解决方案

专注于这一点:

有限自动机(这是常规的基础数据结构 表达式)除了其所处的状态外没有其他内存,如果 您有任意深度的嵌套,您需要任意大的嵌套 自动机,它与有限自动机的概念相冲突.

正则表达式的定义等同于以下事实:可以通过有限的自动机(每个模式一个不同的自动机)来测试字符串是否与模式匹配.有限的自动机没有内存-没有堆栈,没有堆,没有可以擦写的无限磁带.它所具有的只是有限数量的内部状态,每个内部状态都可以从被测试的字符串中读取一个输入单元,并使用该状态来决定移至下一个状态.作为特殊情况,它具有两个终止状态:是,匹配"和否,不匹配".

另一方面,HTML具有可以任意深度嵌套的结构.要确定文件是否为有效的HTML,您需要检查所有的结束标记是否与先前的开始标记匹配.要了解它,您需要知道哪个元素正在关闭.没有任何办法记住"您所看到的打开标签,没有机会.

但是请注意,大多数"regex"库实际上不仅仅允许对正则表达式进行严格定义.如果它们可以匹配反向引用,那么它们已经超越了常规语言.因此,您不应该在HTML上使用正则表达式库的原因比HTML不规则的简单事实要复杂一些.

There is no day on SO that passes without a question about parsing (X)HTML or XML with regular expressions being asked.

While it's relatively easy to come up with examples that demonstrates the non-viability of regexes for this task or with a collection of expressions to represent the concept, I could still not find on SO a formal explanation of why this is not possible done in layman's terms.

The only formal explanations I could find so far on this site are probably extremely accurate, but also quite cryptic to the self-taught programmer:

the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression)

or:

Regular expressions can only match regular languages but HTML is a context-free language.

or:

A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

or:

The Pumping lemma for regular languages is the reason why you can't do that.

[To be fair: the majority of the above explanation link to wikipedia pages, but these are not much easier to understand than the answers themselves].

So my question is: could somebody please provide a translation in layman's terms of the formal explanations given above of why it is not possible to use regex for parsing (X)HTML/XML?

EDIT: After reading the first answer I thought that I should clarify: I am looking for a "translation" that also briefely explains the concepts it tries to translate: at the end of an answer, the reader should have a rough idea - for example - of what "regular language" and "context-free grammar" mean...

解决方案

Concentrate on this one:

A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

The definition of regular expressions is equivalent to the fact that a test of whether a string matches the pattern can be performed by a finite automaton (one different automaton for each pattern). A finite automaton has no memory - no stack, no heap, no infinite tape to scribble on. All it has is a finite number of internal states, each of which can read a unit of input from the string being tested, and use that to decide which state to move to next. As special cases, it has two termination states: "yes, that matched", and "no, that didn't match".

HTML, on the other hand, has structures that can nest arbitrarily deep. To determine whether a file is valid HTML or not, you need to check that all the closing tags match a previous opening tag. To understand it, you need to know which element is being closed. Without any means to "remember" what opening tags you've seen, no chance.

Note however that most "regex" libraries actually permit more than just the strict definition of regular expressions. If they can match back-references, then they've gone beyond a regular language. So the reason why you shouldn't use a regex library on HTML is a little more complex than the simple fact that HTML is not regular.

这篇关于为什么无法使用正则表达式解析HTML/XML:用外行的术语进行的正式解释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆