.string 和 .text BeautifulSoup 的区别 [英] Difference between .string and .text BeautifulSoup

查看:14
本文介绍了.string 和 .text BeautifulSoup 的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 BeautifulSoup 时发现了一些奇怪的地方,但找不到任何文档来支持这一点,所以我想在这里问一下.

I noticed something odd about when working with BeautifulSoup and couldn't find any documentation to support this so I wanted to ask over here.

假设我们有一个这样的标签,我们用 BS 解析过:

Say we have a tags like these that we have parsed with BS:

<td>Some Table Data</td>
<td></td>

官方文档提取数据的方法是soup.string.然而,这为第二个 标签提取了一个 NoneType.所以我尝试了 soup.text(因为为什么不呢?)它完全按照我的意愿提取了一个空字符串.

The official documented way to extract the data is soup.string. However this extracted a NoneType for the second <td> tag. So I tried soup.text (because why not?) and it extracted an empty string exactly as I wanted.

但是,我在文档中找不到对此的任何引用,并且我担心会遗漏某些内容.任何人都可以让我知道这是否可以使用或以后会引起问题吗?

However I couldn't find any reference to this in the documentation and am worried that something is a miss. Can anyone let me know if this is acceptable to use or will it cause problems later?

顺便说一句,我正在从网页中抓取表格数据,并打算从数据中创建 CSV,所以我实际上需要空字符串而不是 NoneType.

BTW I am scraping table data from a web page and mean to create CSVs from the data so I do actually need empty strings rather than NoneTypes.

推荐答案

.stringTag 类型对象上返回一个 NavigableString 类型对象.另一方面, .text 获取所有子字符串并返回使用给定分隔符连接的字符串..text 的返回类型是 unicode 对象.

.string on a Tag type object returns a NavigableString type object. On the other hand, .text gets all the child strings and return concatenated using the given separator. Return type of .text is unicode object.

来自文档,一个NavigableString 就像 Python Unicode 字符串,除了它还支持 导航树搜索树.

From the documentation, A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree.

来自 文档 .string代码>,我们可以看到,如果html是这样的,

From the documentation on .string, we can see that, If the html is like this,

<td>Some Table Data</td>
<td></td>

然后,第二个 td 上的 .string 将返回 None.但是 .text 会返回一个空字符串,它是一个 unicode 类型的对象.

Then, .string on the second td will return None. But .text will return and empty string which is a unicode type object.

为了更方便,

string

  • tag 的便利属性,用于获取此标记中的单个字符串.
  • 如果 tag 有单个字符串子项,则返回值是该字符串.
  • 如果 tag 没有子节点或多个子节点,则返回值为 None
  • 如果这个 tag 有一个子标签,那么返回值是递归的子标签的字符串"属性.
  • Convenience property of a tag to get the single string within this tag.
  • If the tag has a single string child then the return value is that string.
  • If the tag has no children or more than one child then the return value is None
  • If this tag has one child tag then the return value is the 'string' attribute of the child tag, recursively.

text

  • 获取所有子字符串并返回使用给定分隔符连接的字符串.

如果html是这样的:

<td>some text</td>
<td></td>
<td><p>more text</p></td>
<td>even <p>more text</p></td>

.string 四个td 会返回,

some text
None
more text
None

.text 会给出这样的结果,

some text

more text
even more text

这篇关于.string 和 .text BeautifulSoup 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆