使用正则表达式解析HTML:为什么不呢? [英] Using regular expressions to parse HTML: why not?

查看:93
本文介绍了使用正则表达式解析HTML:为什么不呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在stackoverflow上,每个问询者都在使用正则表达式从HTML中获取某些信息,这似乎不可避免地会有一个答案",其中说不使用正则表达式来解析HTML.

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML.

为什么不呢?我知道那里有没有引号的真实" HTML解析器,例如 Beautiful Soup ,而且我确定它们功能强大且有用,但是,如果您只是在做简单,快速或肮脏的事情,那么当一些正则表达式语句可以正常工作时,为什么还要烦恼使用如此复杂的事情呢?

Why not? I'm aware that there are quote-unquote "real" HTML parsers out there like Beautiful Soup, and I'm sure they're powerful and useful, but if you're just doing something simple, quick, or dirty, then why bother using something so complicated when a few regex statements will work just fine?

此外,关于正则表达式,我是否不了解某些基本知识,从而使它们成为一般解析的错误选择?

Moreover, is there just something fundamental that I don't understand about regex that makes them a bad choice for parsing in general?

推荐答案

使用正则表达式无法进行完整的HTML解析,因为它取决于匹配开始和结束标记,而正则表达式则无法实现.

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

正则表达式只能匹配常规语言,但是HTML是

Regular expressions can only match regular languages but HTML is a context-free language and not a regular language (As @StefanPochmann pointed out, regular languages are also context-free, so context-free doesn't necessarily mean not regular). The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

这篇关于使用正则表达式解析HTML:为什么不呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆