BeautifulSoup与现实世界的HTML评论 [英] BeautifulSoup vs. real-world HTML comments

查看：102 发布时间：2019/6/6 11:42:13 python

本文介绍了BeautifulSoup与现实世界的HTML评论的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

浏览器理解为HTML注释的语法比BeautifulSoup理解的限制要少得多。我一直在运行

网站，其中包含正式错误的HTML评论，这些评论被浏览器愉快地解析了

。这是另一个例子，这个来自

" http：//www.webdirectory.com"。页面如下所示：

<！你好！欢迎来到环境目录！>

<！这里没有太多令人兴奋的HTML代码，但它能完成这项工作！ >

<！参见你， - JD>

< HTML>< HEAD>

< TITLE>环境网站目录< / TITLE>

这些当然是无效的HTML评论。但Firefox，IE等处理它们没有问题。

BeautifulSoup根本无法解析这个页面。

它将整个页面视为文本块。它实际上是

HTMLParser解析注释，所以这实际上是一个HTMLParser

级别的问题。

John Nagle

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here''s yet another example, this one from
"http://www.webdirectory.com". The page starts like this:
<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>
Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
John Nagle

推荐答案

4月4日下午2:08，John Nagle< n ... @ animats.comwrote：

On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:

浏览器理解为HTML注释的语法比BeautifulSoup理解的限制要少得多。我一直在运行

网站，其中包含正式错误的HTML评论，这些评论被浏览器愉快地解析了

。这是另一个例子，这个来自

" http：//www.webdirectory.com"。页面如下所示：

<！你好！欢迎来到环境目录！>

<！这里没有太多令人兴奋的HTML代码，但它能完成这项工作！ >

<！参见你， - JD>

< HTML>< HEAD>

< TITLE>环境网站目录< / TITLE>

这些当然是无效的HTML评论。但Firefox，IE等处理它们没有问题。

BeautifulSoup根本无法解析这个页面。

它将整个页面视为文本块。它实际上是

HTMLParser解析注释，所以这实际上是一个HTMLParser

级别的问题。

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here''s yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google用于名为tidy的程序。安装它，并在您下载的任何HTML上以

过滤器的形式运行它。 "整齐"已经投入了相当多的工作来理解常见的错误HTML以及浏览器如何处理

它。在Python

标准库中复制该工作是没有意义的;让HTMLParser小而紧，并将

处理floozy输入外包给专用程序。

Carl Banks

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Carl Banks

Carl Banks写道：

Carl Banks wrote:

4月4日下午2:08，John Nagle< n ... @ animats.comwrote：

On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:

> BeautifulSoup根本无法解析此页面。
它将整个页面视为一个文本块。它实际上是解析注释的HTMLParser，所以这实际上是一个HTMLParser级别的问题。

>BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google用于名为tidy的程序。安装它，并在您下载的任何HTML上以

过滤器的形式运行它。 "整齐"已经投入了相当多的工作来理解常见的错误HTML以及浏览器如何处理

它。在Python

标准库中复制该工作是没有意义的;让HTMLParser小巧紧凑，并将

处理floozy输入外包给专用程序。

嗯，BeautifulSoup就是这样一个专用的库。但是，它将对评论的处理推迟到HTMLParser。那就是问题。

-

Robert Kern

我已经相信了整个世界都是一个谜，一个无害的谜团，因为我们疯狂地试图解释它，好像它有一个潜在的真相，这使得它变得可怕。 br />
- Umberto Eco

Well, BeautifulSoup is just such a dedicated library. However, it defers its
handling of comments to HTMLParser. That''s the problem.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Carl Banks写道：

Carl Banks wrote:

4月4日下午2:08，John Nagle< n ... @ animats.comwrote：

On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:

浏览器的语法理解为HTML评论比BeautifulSoup理解的要少得多。我一直在运行

网站，其中包含正式错误的HTML评论，这些评论被浏览器愉快地解析了

。这是另一个例子，这个来自

" http：//www.webdirectory.com"。页面如下所示：

<！你好！欢迎来到环境目录！>

<！这里没有太多令人兴奋的HTML代码，但它能完成这项工作！ >

<！参见你， - JD>

< HTML>< HEAD>

< TITLE>环境网站目录< / TITLE>

这些当然是无效的HTML评论。但Firefox，IE等处理它们没有问题。

BeautifulSoup根本无法解析这个页面。

它将整个页面视为文本块。它实际上是

HTMLParser解析注释，所以这实际上是一个HTMLParser

级别的问题。

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here''s yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

这是一个很好的建议。事实上它看起来像是一个Python API

for tidy：
http://utidylib.berlios.de/

尝试过，似乎摆脱了<！评论很好。

That''s a good suggestion. In fact it looks like there''s a Python API
for tidy:
http://utidylib.berlios.de/
Tried it, seems to get rid of <! comments just fine.

这篇关于BeautifulSoup与现实世界的HTML评论的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BeautifulSoup与现实世界的HTML评论 [英] BeautifulSoup vs. real-world HTML comments

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

BeautifulSoup与现实世界的HTML评论 [英] BeautifulSoup vs. real-world HTML comments

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭