如何prevent BeautifulSoup4从添加额外< HTML和GT;<身体GT;标签汤? [英] How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup?

查看:147
本文介绍了如何prevent BeautifulSoup4从添加额外< HTML和GT;<身体GT;标签汤?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在BeautifulSoup版本我可以很容易地采取任何HTML块并以这种方式获得的字符串重新presentation:

In BeautifulSoup versions prior to 3 I could easily take any chunk of HTML and get a string representation in this way:

from BeautifulSoup import BeautifulSoup
soup3 = BeautifulSoup('<div><b>soup 3</b></div>')
print unicode(soup3)
    '<div><b>soup</b></div>'

然而,随着BeautifulSoup4同样的操作产生额外的标记:

However with BeautifulSoup4 the same operation creates additional tags:

from bs4 import BeautifulSoup
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
print unicode(soup4)
    '<html><body><div><b>soup 4</b></div></body></html>'
     ^^^^^^^^^^^^                        ^^^^^^^^^^^^^^ 

我不需要外部的 &LT; HTML&GT;&LT;身体GT; ..&LT; /身体GT;&LT; / HTML&GT; 标记,BS4被添加。我已经通过BS4文档看上去也搜查了类中,但无法找到苏pressing额外的标签在输出中的任何设置。我该怎么做?降级到V3是不是一种选择,因为在BS3中使用的SGML解析器附近没有一样好了 LXML html5lib 解析器可用与BS4。

I don't need the outer <html><body>..</body></html> tags that BS4 is adding. I have looked through the BS4 docs and also searched inside the class but could not find any setting for supressing the extra tags in the output. How do I do it? Downgrading to v3 is not an option since the SGML parser used in BS3 is not near as good as the lxml or html5lib parsers that are available with BS4.

推荐答案

如果你希望你的code到每个人的机器上工作,无论他们所安装的解析器(S)等(同 LXML 版本建立在的libxml2 2.9与2.8的行为非常不同,在STDLIB html.parser 有2.7.2和2.7.3之间,...)一些根本性的改变,你pretty太多需要处理所有合法的结果。

If you want your code to work on everyone's machine, no matter which parser(s) they have installed, etc. (the same lxml version built on libxml2 2.9 vs. 2.8 acts very differently, the stdlib html.parser had some radical changes between 2.7.2 and 2.7.3, …), you pretty much need to handle all of the legitimate results.

如果你知道你有一个片段,像这样会给你正是这样的片段:

If you know you have a fragment, something like this will give you exactly that fragment:

soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
if soup4.body:
    return soup4.body.next
elif soup4.html:
    return soup4.html.next
else:
    return soup4

当然,如果你知道你的片段是一个 DIV ,那就更简单了,而是要思考一个用例在那里你会知道,它不是那么容易:

Of course if you know your fragment is a single div, it's even easier—but it's not as easy to think of a use case where you'd know that:

soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
return soup4.div


如果你想知道的为什么的这种情况:

BeautifulSoup 是用于解析HTML文档。一个HTML片段是不是一个有效的文档。它的 pretty关闭的一个文件,但是这还不够好,以保证你会得到你给到底是什么。

BeautifulSoup is intended for parsing HTML documents. An HTML fragment is not a valid document. It's pretty close to a document, but that's not good enough to guarantee that you'll get back exactly what you give it.

由于解析器之间的差异说:

有也HTML解析器之间的差异。如果你给美丽的汤完全形成HTML文档,这些差异并不重要。一个解析器会比另一个快,但他们都会给你一个数据结构,看起来酷似原始的HTML文档。

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

但是,如果是不完全形成​​的文档,不同解析器将给出不同的结果。

But if the document is not perfectly-formed, different parsers will give different results.

所以,虽然这个确切的差异没有记录,它的东西,只是一个特例。

So, while this exact difference isn't documented, it's just a special case of something that is.

这篇关于如何prevent BeautifulSoup4从添加额外&LT; HTML和GT;&LT;身体GT;标签汤?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆