为什么要使用dom来解析网页而不是正则表达式? [英] Why use dom to parse webpages instead of regex?

查看:159
本文介绍了为什么要使用dom来解析网页而不是正则表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找有关在页面中查找内容的问题,而解析网页而不是 REGEX / code>。为什么会这样?

解决方案

DOM解析器实际上是解析页面。



正则表达式正在搜索文本,而不是理解HTML的语义含义。



可以证明HTML不是常规语言;因此,不可能创建一个正则表达式,它将从HTML文档中解析任意元素模式的所有实例,而不需要匹配一些不是该元素模式实例的文本。 p>

您可以设计一个适合您特定用例的正则表达式,但可以预见您将被提供的HTML(因此,它将如何破坏您的有限用例正则表达式)是非常困难的。



另外,正则表达式更难适应页面内容的更改一个XPath表达式,XPath(在我看来)更容易阅读,因为它不需要关心句法错误和结束,如标签开放和关闭。



所以,而不是为作业使用错误的工具(用于结构化文档的文本解析工具)为作业使用正确的工具(用于解析HTML的HTML解析器)。


I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.

解决方案

A DOM parser is actually parsing the page.

A regular expression is searching for text, not understanding the HTML's semantic meaning.

It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.

You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.

Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.

So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).

这篇关于为什么要使用dom来解析网页而不是正则表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆