BeautifulSoup(BS4)解析错误 [英] BeautifulSoup (bs4) parsing wrong

查看:145
本文介绍了BeautifulSoup(BS4)解析错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用python 2.7.6中的bs4解析此示例文档:

Parsing this sample document with bs4, from python 2.7.6:

<html>
<body>
<p>HTML allows omitting P end-tags.

<p>Like that and this.

<p>And this, too.

<p>What happened?</p>

<p>And can we <p>nest a paragraph, too?</p></p>

</body>
</html>

使用:

from bs4 import BeautifulSoup as BS
...
tree = BS(fh)

很久以来,HTML一直允许各种元素类型(包括P(检查架构或解析器))的省略的结束标签.但是,bs4在此文档上的prettify()表明,直到看到</body>:

HTML has, for ages, allowed omitted end-tags for various element types, including P (check the schema, or a parser). However, bs4's prettify() on this document shows that it doesn't end any of those paragraphs until it sees </body>:

<html>
 <body>
  <p>
   HTML allows omitting P end-tags.
   <p>
    Like that and this.
    <p>
     And this, too.
     <p>
      What happened?
     </p>
     <p>
      And can we
      <p>
       nest a paragraph, too?
      </p>
     </p>
    </p>
   </p>
  </p>
 </body>

这不是prettify()的错,因为手动遍历树得到了相同的结构:

It's not prettify()'s fault, because traversing the tree manually I get the same structure:

<[document]>
    <html>
        ␊
        <body>
            ␊
            <p>
                HTML allows omitting P end-tags.␊␊
                <p>
                    Like that and this.␊␊
                    <p>
                        And this, too.␊␊
                        <p>
                            What happened?
                        </p>
                        ␊
                        <p>
                            And can we 
                            <p>
                                nest a paragraph, too?
                            </p>
                        </p>
                        ␊
                    </p>
                </p>
            </p>
        </body>
        ␊
    </html>
    ␊
</[document]>

现在,这将是XML的正确结果(至少达到</body>,此时应报告WF错误).但这不是XML.有什么作用?

Now, this would be the right result for XML (at least up to </body>, at which point it should report a WF error). But this ain't XML. What gives?

推荐答案

位于 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser 介绍如何使BS4使用不同的解析器.显然,默认值是html.parse,BS4文档说它在Python 2.7.3之前就被破坏了,但是显然仍然存在上面2.7.6中描述的问题.

The doc at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser tells how to get BS4 to use different parsers. Apparently the default is html.parse, which the BS4 doc says is broken before Python 2.7.3, but apparently still has the problem described above in 2.7.6.

切换到"lxml"对我来说并不成功,但是切换到"html5lib"会产生正确的结果:

Switching to "lxml" was unsuccessful for me, but switching to "html5lib" produces the correct result:

tree = BS(htmSource, "html5lib")

这篇关于BeautifulSoup(BS4)解析错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆