BeautifulSoup与现实世界的HTML评论 [英] BeautifulSoup vs. real-world HTML comments
问题描述
浏览器理解为HTML注释的语法比BeautifulSoup理解的限制要少得多。我一直在运行
网站,其中包含正式错误的HTML评论,这些评论被浏览器愉快地解析了
。这是另一个例子,这个来自
" http://www.webdirectory.com"。页面如下所示:
<!你好!欢迎来到环境目录!>
<!这里没有太多令人兴奋的HTML代码,但它能完成这项工作! >
<!参见你, - JD>
< HTML>< HEAD>
< TITLE>环境网站目录< / TITLE>
这些当然是无效的HTML评论。但Firefox,IE等处理它们没有问题。
BeautifulSoup根本无法解析这个页面。
它将整个页面视为文本块。它实际上是
HTMLParser解析注释,所以这实际上是一个HTMLParser
级别的问题。
John Nagle
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here''s yet another example, this one from
"http://www.webdirectory.com". The page starts like this:
<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >
<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>
Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.
BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
John Nagle
推荐答案
4月4日下午2:08,John Nagle< n ... @ animats.comwrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
浏览器理解为HTML注释的语法比BeautifulSoup理解的限制要少得多。我一直在运行
网站,其中包含正式错误的HTML评论,这些评论被浏览器愉快地解析了
。这是另一个例子,这个来自
" http://www.webdirectory.com"。页面如下所示:
<!你好!欢迎来到环境目录!>
<!这里没有太多令人兴奋的HTML代码,但它能完成这项工作! >
<!参见你, - JD>
< HTML>< HEAD>
< TITLE>环境网站目录< / TITLE>
这些当然是无效的HTML评论。但Firefox,IE等处理它们没有问题。
BeautifulSoup根本无法解析这个页面。
它将整个页面视为文本块。它实际上是
HTMLParser解析注释,所以这实际上是一个HTMLParser
级别的问题。
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here''s yet another example, this one from
"http://www.webdirectory.com". The page starts like this:
<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >
<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>
Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.
BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
Google用于名为tidy的程序。安装它,并在您下载的任何HTML上以
过滤器的形式运行它。 "整齐"已经投入了相当多的工作来理解常见的错误HTML以及浏览器如何处理
它。在Python
标准库中复制该工作是没有意义的;让HTMLParser小而紧,并将
处理floozy输入外包给专用程序。
Carl Banks
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Carl Banks
Carl Banks写道:
Carl Banks wrote:
4月4日下午2:08,John Nagle< n ... @ animats.comwrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
> BeautifulSoup根本无法解析此页面。
它将整个页面视为一个文本块。它实际上是解析注释的HTMLParser,所以这实际上是一个HTMLParser级别的问题。
>BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
Google用于名为tidy的程序。安装它,并在您下载的任何HTML上以
过滤器的形式运行它。 "整齐"已经投入了相当多的工作来理解常见的错误HTML以及浏览器如何处理
它。在Python
标准库中复制该工作是没有意义的;让HTMLParser小巧紧凑,并将
处理floozy输入外包给专用程序。
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
嗯,BeautifulSoup就是这样一个专用的库。但是,它将对评论的处理推迟到HTMLParser。那就是问题。
-
Robert Kern
我已经相信了整个世界都是一个谜,一个无害的谜团,因为我们疯狂地试图解释它,好像它有一个潜在的真相,这使得它变得可怕。 br />
- Umberto Eco
Well, BeautifulSoup is just such a dedicated library. However, it defers its
handling of comments to HTMLParser. That''s the problem.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
Carl Banks写道:
Carl Banks wrote:
4月4日下午2:08,John Nagle< n ... @ animats.comwrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
浏览器的语法理解为HTML评论比BeautifulSoup理解的要少得多。我一直在运行
网站,其中包含正式错误的HTML评论,这些评论被浏览器愉快地解析了
。这是另一个例子,这个来自
" http://www.webdirectory.com"。页面如下所示:
<!你好!欢迎来到环境目录!>
<!这里没有太多令人兴奋的HTML代码,但它能完成这项工作! >
<!参见你, - JD>
< HTML>< HEAD>
< TITLE>环境网站目录< / TITLE>
这些当然是无效的HTML评论。但Firefox,IE等处理它们没有问题。
BeautifulSoup根本无法解析这个页面。
它将整个页面视为文本块。它实际上是
HTMLParser解析注释,所以这实际上是一个HTMLParser
级别的问题。
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here''s yet another example, this one from
"http://www.webdirectory.com". The page starts like this:
<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >
<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>
Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.
BeautifulSoup can''t parse this page usefully at all.
It treats the entire page as a text chunk. It''s actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
Google用于名为tidy的程序。安装它,并在您下载的任何HTML上以
过滤器的形式运行它。 "整齐"已经投入了相当多的工作来理解常见的错误HTML以及浏览器如何处理
它。在Python
标准库中复制该工作是没有意义的;让HTMLParser小巧紧凑,并将
处理floozy输入外包给专用程序。
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
这是一个很好的建议。事实上它看起来像是一个Python API
for tidy:
http://utidylib.berlios.de/
尝试过,似乎摆脱了<!评论很好。
That''s a good suggestion. In fact it looks like there''s a Python API
for tidy:
http://utidylib.berlios.de/
Tried it, seems to get rid of <! comments just fine.
这篇关于BeautifulSoup与现实世界的HTML评论的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!