.string 和 .text BeautifulSoup 的区别 [英] Difference between .string and .text BeautifulSoup
问题描述
我在使用 BeautifulSoup 时发现了一些奇怪的地方,但找不到任何文档来支持这一点,所以我想在这里问一下.
I noticed something odd about when working with BeautifulSoup and couldn't find any documentation to support this so I wanted to ask over here.
假设我们有一个这样的标签,我们用 BS 解析过:
Say we have a tags like these that we have parsed with BS:
<td>Some Table Data</td>
<td></td>
官方文档提取数据的方法是 The official documented way to extract the data is 但是,我在文档中找不到对此的任何引用,并且我担心会遗漏某些内容.任何人都可以让我知道这是否可以使用或以后会引起问题吗? However I couldn't find any reference to this in the documentation and am worried that something is a miss. Can anyone let me know if this is acceptable to use or will it cause problems later? 顺便说一句,我正在从网页中抓取表格数据,并打算从数据中创建 CSV,所以我实际上需要空字符串而不是 NoneType. BTW I am scraping table data from a web page and mean to create CSVs from the data so I do actually need empty strings rather than NoneTypes. 来自文档,一个 From the documentation, A 来自 文档 From the documentation on 然后,第二个 td 上的 Then, 为了更方便, 和 如果 这篇关于.string 和 .text BeautifulSoup 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!soup.string代码>.然而,这为第二个
标签提取了一个 NoneType.所以我尝试了 soup.text
(因为为什么不呢?)它完全按照我的意愿提取了一个空字符串.
soup.string
. However this extracted a NoneType for the second <td>
tag. So I tried soup.text
(because why not?) and it extracted an empty string exactly as I wanted.推荐答案
.string
在 Tag
类型对象上返回一个 NavigableString
类型对象.另一方面, .text
获取所有子字符串并返回使用给定分隔符连接的字符串..text 的返回类型是 unicode
对象..string
on a Tag
type object returns a NavigableString
type object. On the other hand, .text
gets all the child strings and return concatenated using the given separator. Return type of .text is unicode
object.NavigableString
就像 Python Unicode
字符串,除了它还支持 导航树 和 搜索树.NavigableString
is just like a Python Unicode
string, except that it also supports some of the features described in Navigating the tree and Searching the tree..string代码>,我们可以看到,如果html是这样的,
.string
, we can see that, If the html is like this,<td>Some Table Data</td>
<td></td>
.string
将返回 None
.但是 .text
会返回一个空字符串,它是一个 unicode
类型的对象..string
on the second td will return None
.
But .text
will return and empty string which is a unicode
type object.string
tag
的便利属性,用于获取此标记中的单个字符串.tag
有单个字符串子项,则返回值是该字符串.tag
没有子节点或多个子节点,则返回值为 None
tag
有一个子标签,那么返回值是递归的子标签的字符串"属性.
tag
to get the single string within this tag.tag
has a single string child then the return value is that string.tag
has no children or more than one child then the return value is None
tag
has one child tag then the return value is the 'string' attribute of the child tag, recursively.text
html
是这样的:<td>some text</td>
<td></td>
<td><p>more text</p></td>
<td>even <p>more text</p></td>
.string
四个td
会返回,some text
None
more text
None
.text
会给出这样的结果,some text
more text
even more text
登录
关闭