使用正则表达式解析 HTML:为什么不呢? [英] Using regular expressions to parse HTML: why not?

查看:36
本文介绍了使用正则表达式解析 HTML:为什么不呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎在 stackoverflow 上提问者使用正则表达式从 HTML 中获取一些信息的每个问题都不可避免地会有一个答案",即不使用正则表达式来解析 HTML.

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML.

为什么不呢?我知道有引用-取消引用真正的"HTML 解析器,例如 Beautiful Soup,而且我确信它们功能强大且有用,但是如果您只是在做一些简单、快速或肮脏的事情,那么当一些正则表达式语句可以正常工作时,为什么还要使用如此复杂的东西?

Why not? I'm aware that there are quote-unquote "real" HTML parsers out there like Beautiful Soup, and I'm sure they're powerful and useful, but if you're just doing something simple, quick, or dirty, then why bother using something so complicated when a few regex statements will work just fine?

此外,是否有一些我不了解正则表达式的基本原理使它们成为一般解析的错误选择?

Moreover, is there just something fundamental that I don't understand about regex that makes them a bad choice for parsing in general?

推荐答案

使用正则表达式无法解析整个 HTML,因为它依赖于匹配开始和结束标记,而使用正则表达式则无法实现.

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

正则表达式只能匹配正则语言,但 HTML 是 无上下文语言并且不是常规语言(正如@StefanPochmann指出的那样,常规语言是也是上下文无关的,所以上下文无关并不一定意味着不规则).您可以在 HTML 上使用正则表达式做的唯一一件事是启发式,但这不适用于所有条件.应该可以呈现任何正则表达式都会错误匹配的 HTML 文件.

Regular expressions can only match regular languages but HTML is a context-free language and not a regular language (As @StefanPochmann pointed out, regular languages are also context-free, so context-free doesn't necessarily mean not regular). The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

这篇关于使用正则表达式解析 HTML:为什么不呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆