beautifulsoup配置autoclosing标签 [英] beautifulsoup configure autoclosing tags
问题描述
让我们通过例子来说明我的问题:
从BS4进口BeautifulSoup的txt =
< HTML和GT;
<身体GT;
< UL>
<立GT; 1
<立GT; 2
< / UL>
< /身体GT;
汤= BeautifulSoup(TXT)打印(汤。prettify())
下面这个脚本的输出:
< HTML和GT;
<身体GT;
< UL>
<立GT;
1
<立GT;
2
< /李>
< /李>
< / UL>
< /身体GT;
< / HTML>
正如你可以在输入HTML 见李
标记未关闭。 BeautifulSoup固定它以某种方式。但是,它可以配置BeautifulSoup来获得对输出这个结果?
< HTML和GT;
<身体GT;
< UL>
<立GT;
1
< /李>
<立GT;
2
< /李>
< / UL>
< /身体GT;
< / HTML>
在'固定'被应用的解析器的用于在HTML加载到BeautifulSoup对象树。
您可以换出不同的解析器;破碎的HTML是由不同的解析器不同的方式修复。你必须安装额外的软件包;默认情况下只有 html.parser
选项。
我在这里使用了 html5lib
分析器,它会间preT非标准的HTML同样的方式将浏览器,或者你可以尝试 LXML
解析:
>>>打印BeautifulSoup(TXT,'html5lib')。prettify()
< HTML和GT;
< HEAD>
< /头>
<身体GT;
< UL>
<立GT;
1
< /李>
<立GT;
2
< /李>
< / UL>
< /身体GT;
< / HTML>
>>>打印BeautifulSoup(TXT,'LXML')。prettify()
< HTML和GT;
<身体GT;
< UL>
<立GT;
1
< /李>
<立GT;
2
< /李>
< / UL>
< /身体GT;
< / HTML>
正如你可以看到,这两个产生所需的输出。
这只是默认解析器出现此问题:
>>>打印BeautifulSoup(TXT,'html.parser')。prettify()
< HTML和GT;
<身体GT;
< UL>
<立GT;
1
<立GT;
2
< /李>
< /李>
< / UL>
< /身体GT;
< / HTML>
Let's explain my issue by example:
from bs4 import BeautifulSoup
txt = """
<html>
<body>
<ul>
<li> 1
<li> 2
</ul>
</body>
"""
soup = BeautifulSoup(txt)
print(soup.prettify())
Here output of this script:
<html>
<body>
<ul>
<li>
1
<li>
2
</li>
</li>
</ul>
</body>
</html>
As you can see in the input html li
tags were not closed. BeautifulSoup fixed it in some way. But is it possible to configure BeautifulSoup to get this result on the output?
<html>
<body>
<ul>
<li>
1
</li>
<li>
2
</li>
</ul>
</body>
</html>
The 'fixing' is applied by the parser used to load the HTML into the BeautifulSoup object tree.
You can swap out different parsers; broken HTML is repaired in different ways by different parsers. You'll have to install additional packages; by default only the html.parser
option is available.
I'd use the html5lib
parser here, it'll interpret non-standard HTML the same way a browser would, or you can try the lxml
parser:
>>> print BeautifulSoup(txt, 'html5lib').prettify()
<html>
<head>
</head>
<body>
<ul>
<li>
1
</li>
<li>
2
</li>
</ul>
</body>
</html>
>>> print BeautifulSoup(txt, 'lxml').prettify()
<html>
<body>
<ul>
<li>
1
</li>
<li>
2
</li>
</ul>
</body>
</html>
As you can see, both these produce the desired output.
It's only the default parser that exhibits this problem:
>>> print BeautifulSoup(txt, 'html.parser').prettify()
<html>
<body>
<ul>
<li>
1
<li>
2
</li>
</li>
</ul>
</body>
</html>
这篇关于beautifulsoup配置autoclosing标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!