为什么不能使用正则表达式来解析 HTML/XML:外行的正式解释 [英] Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

查看:22
本文介绍了为什么不能使用正则表达式来解析 HTML/XML:外行的正式解释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于使用正则表达式解析 (X)HTML 或 XML 的问题,SO 上的每一天都没有过.

虽然想出证明正则表达式对于此任务不可行的示例相对容易表达式集合来表示概念,我仍然无法在 SO 上找到正式的解释,说明为什么用外行的术语无法做到这一点.

到目前为止,我在本网站上能找到的唯一正式解释可能非常准确,但对于自学成才的程序员来说也很神秘:

<块引用>

这里的缺陷是 HTML 是 Chomsky Type 2 语法(上下文无关)语法)和 RegEx 是 Chomsky Type 3 语法(正则表达式)

或:

<块引用>

正则表达式只能匹配正则语言,而 HTML 是一个上下文无关语言.

或:

<块引用>

一个有限自动机(它是一个正则的基础数据结构表达式)除了它所处的状态之外没有内存,如果你有任意深的嵌套,你需要一个任意大的自动机,它与有限自动机的概念相冲突.

或:

<块引用>

常规语言的 Pumping 引理是你不能做的原因

[公平地说:上面的大部分解释链接到维基百科页面,但这些并不比答案本身更容易理解].

所以我的问题是:有人可以用外行的术语翻译上面给出的为什么不能使用正则表达式来解析 (X)HTML/XML 的正式解释吗?>

在阅读第一个答案后,我认为我应该澄清:我正在寻找一个翻译",它也简要地解释它试图翻译的概念:在回答的最后,读者应该有一个粗略的想法——例如——常规语言"和上下文无关语法"是什么意思......

解决方案

专注于此:

<块引用>

一个有限自动机(它是一个正则的基础数据结构表达式)除了它所处的状态之外没有内存,如果你有任意深的嵌套,你需要一个任意大的自动机,它与有限自动机的概念相冲突.

正则表达式的定义等价于这样一个事实,即可以通过有限自动机(每个模式有一个不同的自动机)来测试字符串是否与模式匹配.有限自动机没有内存——没有堆栈,没有堆,没有无限的磁带可供涂鸦.它所拥有的只是有限数量的内部状态,每个状态都可以从被测试的字符串中读取一个输入单元,并使用它来决定移动到下一个状态.作为特殊情况,它有两个终止状态:是,匹配"和否,不匹配".

另一方面,HTML 具有可以任意深度嵌套的结构.要确定文件是否是有效的 HTML,您需要检查所有结束标记是否与之前的开始标记匹配.要理解它,您需要知道哪个元素正在关闭.没有任何办法记住"你看过什么开头标签,没有机会.

但是请注意,大多数regex"库实际上不仅仅允许正则表达式的严格定义.如果它们可以匹配反向引用,那么它们就超越了常规语言.因此,您不应该在 HTML 上使用正则表达式库的原因比 HTML 不规则这一简单事实要复杂一些.

There is no day on SO that passes without a question about parsing (X)HTML or XML with regular expressions being asked.

While it's relatively easy to come up with examples that demonstrates the non-viability of regexes for this task or with a collection of expressions to represent the concept, I could still not find on SO a formal explanation of why this is not possible done in layman's terms.

The only formal explanations I could find so far on this site are probably extremely accurate, but also quite cryptic to the self-taught programmer:

the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression)

or:

Regular expressions can only match regular languages but HTML is a context-free language.

or:

A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

or:

The Pumping lemma for regular languages is the reason why you can't do that.

[To be fair: the majority of the above explanation link to wikipedia pages, but these are not much easier to understand than the answers themselves].

So my question is: could somebody please provide a translation in layman's terms of the formal explanations given above of why it is not possible to use regex for parsing (X)HTML/XML?

EDIT: After reading the first answer I thought that I should clarify: I am looking for a "translation" that also briefely explains the concepts it tries to translate: at the end of an answer, the reader should have a rough idea - for example - of what "regular language" and "context-free grammar" mean...

解决方案

Concentrate on this one:

A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

The definition of regular expressions is equivalent to the fact that a test of whether a string matches the pattern can be performed by a finite automaton (one different automaton for each pattern). A finite automaton has no memory - no stack, no heap, no infinite tape to scribble on. All it has is a finite number of internal states, each of which can read a unit of input from the string being tested, and use that to decide which state to move to next. As special cases, it has two termination states: "yes, that matched", and "no, that didn't match".

HTML, on the other hand, has structures that can nest arbitrarily deep. To determine whether a file is valid HTML or not, you need to check that all the closing tags match a previous opening tag. To understand it, you need to know which element is being closed. Without any means to "remember" what opening tags you've seen, no chance.

Note however that most "regex" libraries actually permit more than just the strict definition of regular expressions. If they can match back-references, then they've gone beyond a regular language. So the reason why you shouldn't use a regex library on HTML is a little more complex than the simple fact that HTML is not regular.

这篇关于为什么不能使用正则表达式来解析 HTML/XML:外行的正式解释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆