BeautifulSoup不同的解析器 [英] BeautifulSoup different parsers

查看：54 发布时间：2021/4/15 19:02:31 python-3.x beautifulsoup

本文介绍了BeautifulSoup不同的解析器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有人能详细说明html.parser和html5lib之类的解析器之间的区别吗?我偶然发现了一个奇怪的行为，在使用html.parser时，它会忽略特定位置的所有标记.看这段代码

could anyone elaborate more about the difference between parsers like html.parser and html5lib? I've stumbled across a weird behavior where when using html.parser it ignores all the tags in specific place. look at this code

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
print(tags)

这将返回一个空列表，而使用html5lib时，将按预期返回所需的"a"标签.有人知道原因吗?

this will return an empty list, whereas when using html5lib, the desired "a" tags are returned as expected. does anyone know the reason for that ?

我已经阅读了文档，但是关于不同解析器的解释非常模糊.

I've read the documentation but the explanation about the different parsers is pretty vague..

我还注意到html5lib会忽略无效标签(例如嵌套表单标签)，有没有办法使用html5lib来避免html.parser的上述行为，并且还会获得无效标签(例如嵌套表单标签)?(当使用html5lib解析时，将删除其中一个表单标签)

提前谢谢.

推荐答案

您可以使用非常快速的 lxml ，并且可以使用 find_all 或 select 获取所有标签.

You can use lxml which is very fast and can use find_all or select to get all tags.

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

这篇关于BeautifulSoup不同的解析器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BeautifulSoup不同的解析器 [英] BeautifulSoup different parsers

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

BeautifulSoup不同的解析器 [英] BeautifulSoup different parsers

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭