BeautifulSoup与现实世界的HTML评论 [英] BeautifulSoup vs. real-world HTML comments

查看:102
本文介绍了BeautifulSoup与现实世界的HTML评论的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

浏览器理解为HTML注释的语法比BeautifulSoup理解的限制要少得多。我一直在运行

网站,其中包含正式错误的HTML评论,这些评论被浏览器愉快地解析了

。这是另一个例子,这个来自

" http://www.webdirectory.com"。页面如下所示:

<!你好!欢迎来到环境目录!>

<!这里没有太多令人兴奋的HTML代码,但它能完成这项工作! >

<!参见你, - JD>


< HTML>< HEAD>

< TITLE>环境网站目录< / TITLE>

这些当然是无效的HTML评论。但Firefox,IE等处理它们没有问题。


BeautifulSoup根本无法解析这个页面。

它将整个页面视为文本块。它实际上是

HTMLParser解析注释,所以这实际上是一个HTMLParser

级别的问题。

John Nagle

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here''s yet another example, this one from
"http://www.webdirectory.com". The page starts like this:
<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>
Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
John Nagle

推荐答案

4月4日下午2:08,John Nagle< n ... @ animats.comwrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:

浏览器理解为HTML注释的语法比BeautifulSoup理解的限制要少得多。我一直在运行

网站,其中包含正式错误的HTML评论,这些评论被浏览器愉快地解析了

。这是另一个例子,这个来自

" http://www.webdirectory.com"。页面如下所示:


<!你好!欢迎来到环境目录!>

<!这里没有太多令人兴奋的HTML代码,但它能完成这项工作! >

<!参见你, - JD>


< HTML>< HEAD>

< TITLE>环境网站目录< / TITLE>


这些当然是无效的HTML评论。但Firefox,IE等处理它们没有问题。


BeautifulSoup根本无法解析这个页面。

它将整个页面视为文本块。它实际上是

HTMLParser解析注释,所以这实际上是一个HTMLParser

级别的问题。
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here''s yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.



Google用于名为tidy的程序。安装它,并在您下载的任何HTML上以

过滤器的形式运行它。 "整齐"已经投入了相当多的工作来理解常见的错误HTML以及浏览器如何处理

它。在Python

标准库中复制该工作是没有意义的;让HTMLParser小而紧,并将

处理floozy输入外包给专用程序。

Carl Banks

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Carl Banks


Carl Banks写道:
Carl Banks wrote:

4月4日下午2:08,John Nagle< n ... @ animats.comwrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:


> BeautifulSoup根本无法解析此页面。
它将整个页面视为一个文本块。它实际上是解析注释的HTMLParser,所以这实际上是一个HTMLParser级别的问题。
>BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.



Google用于名为tidy的程序。安装它,并在您下载的任何HTML上以

过滤器的形式运行它。 "整齐"已经投入了相当多的工作来理解常见的错误HTML以及浏览器如何处理

它。在Python

标准库中复制该工作是没有意义的;让HTMLParser小巧紧凑,并将

处理floozy输入外包给专用程序。


Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.



嗯,BeautifulSoup就是这样一个专用的库。但是,它将对评论的处理推迟到HTMLParser。那就是问题。


-

Robert Kern


我已经相信了整个世界都是一个谜,一个无害的谜团,因为我们疯狂地试图解释它,好像它有一个潜在的真相,这使得它变得可怕。 br />
- Umberto Eco

Well, BeautifulSoup is just such a dedicated library. However, it defers its
handling of comments to HTMLParser. That''s the problem.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco




Carl Banks写道:

Carl Banks wrote:

4月4日下午2:08,John Nagle< n ... @ animats.comwrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:

浏览器的语法理解为HTML评论比BeautifulSoup理解的要少得多。我一直在运行

网站,其中包含正式错误的HTML评论,这些评论被浏览器愉快地解析了

。这是另一个例子,这个来自

" http://www.webdirectory.com"。页面如下所示:


<!你好!欢迎来到环境目录!>

<!这里没有太多令人兴奋的HTML代码,但它能完成这项工作! >

<!参见你, - JD>


< HTML>< HEAD>

< TITLE>环境网站目录< / TITLE>


这些当然是无效的HTML评论。但Firefox,IE等处理它们没有问题。


BeautifulSoup根本无法解析这个页面。

它将整个页面视为文本块。它实际上是

HTMLParser解析注释,所以这实际上是一个HTMLParser

级别的问题。
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here''s yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.



Google用于名为tidy的程序。安装它,并在您下载的任何HTML上以

过滤器的形式运行它。 "整齐"已经投入了相当多的工作来理解常见的错误HTML以及浏览器如何处理

它。在Python

标准库中复制该工作是没有意义的;让HTMLParser小巧紧凑,并将

处理floozy输入外包给专用程序。


Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.



这是一个很好的建议。事实上它看起来像是一个Python API

for tidy:
http://utidylib.berlios.de/

尝试过,似乎摆脱了<!评论很好。

That''s a good suggestion. In fact it looks like there''s a Python API
for tidy:
http://utidylib.berlios.de/
Tried it, seems to get rid of <! comments just fine.


这篇关于BeautifulSoup与现实世界的HTML评论的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆