为什么BeautifulSoup .children包含无名元素以及预期的标记 [英] Why does BeautifulSoup .children contain nameless elements as well as the expected tag(s)

查看:121
本文介绍了为什么BeautifulSoup .children包含无名元素以及预期的标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

#!/usr/bin/env python3
from bs4 import BeautifulSoup

test="""<!DOCTYPE html>
<html>
<head>
 <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
 <title>Test</title>
</head>
<body>
<table>
<tbody>
<tr>
 <td>
  <div>
   <b>
    Icon
   </b>
  </div>
 </td>
</tr>
</tbody>
</table>
</body>
</html>"""

soup = BeautifulSoup(test2)
rows = soup.findAll('tr')
for r in rows:
    print(r.name)
    for c in r.children:
        print('>', c.name)

输出

tr
> None
> td
> None

为什么该行的子级列表中没有匿名元素?

在Windows 8上使用html.parser(这是Python的内置版本)在64位Python 3.3.1上运行时会发生这种情况.

Why are there nameless elements in the list of the row's children?

This occurs running Python 3.3.1 64-bit on Windows 8, with html.parser (that's Python's built-in one).

推荐答案

.children的元素可以是 NavigableStrings 以及标签.在您的示例中,它们是td元素前后的空白.

The elements of .children can be NavigableStrings as well as Tags. In the case of your example, they're the whitespace before and after the td element.

您的代码的这种变化希望可以清楚地说明:

This variation on your code hopefully makes it clear:

>>> rows = soup.findAll('tr')
>>> for r in rows:
...     print("row:", r.name)
...     for c in r.children:
...         print("---")
...         print(type(c))
...         print(repr(c))
... 
row: tr
---
<class 'bs4.element.NavigableString'>
'\n'
---
<class 'bs4.element.Tag'>
<td>
<div>
<b>
    Icon
   </b>
</div>
</td>
---
<class 'bs4.element.NavigableString'>
'\n'

这篇关于为什么BeautifulSoup .children包含无名元素以及预期的标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆