为什么 bs4 返回标签,然后返回一个空列表到这个 find_all() 方法? [英] Why does bs4 return tags and then an empty list to this find_all() method?

查看:30
本文介绍了为什么 bs4 返回标签,然后返回一个空列表到这个 find_all() 方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

查看 美国人口普查 QFD 我试图抓住按县划分的种族百分比.我正在构建的循环超出了我的问题范围,这涉及此代码:

url = 'http://quickfacts.census.gov/qfd/states/48/48507.html'#德克萨斯州的最后一个县;出于某种原因,qfd # 的县只有奇数页面 = urllib2.urlopen(url)汤 = BeautifulSoup(页面)c_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[0] #c = 县 %s_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[1] #s = state %

抓取包含其标签的 html 元素,而不仅仅是其中的文本:

c_black_alone, s_black_alone(<td align="right" headers="rp9 p1" valign="bottom">96.9%<sup></sup></td>,<td align="right" headers="rp9 p2" valign="bottom">80.3%<sup></sup></td>)

在 ^ 以上,我只想要元素内的 %...

此外,为什么

test_black = soup.find_all("td", text = "Black")

不返回与上面相同的元素(或其文本),而是返回一个空的 bs4 ResultSet 对象?(我一直在关注文档,所以我希望这个问题看起来不太模糊......)

解决方案

要从这些匹配中获取文本,请使用 .text 获取所有包含的文本:<预><代码>>>>汤.find_all("td", attrs={'headers':'rp9'})[0].text你'96.9%'>>>汤.find_all("td", attrs={'headers':'rp9'})[1].text你'80.3%'

您的 text 搜索不匹配任何内容,原因有两个:

  1. 文字字符串只匹配整个包含的文本,而不是部分匹配.它仅适用于 <td>Black</td> 作为 sole 内容的元素.
  2. 它将使用 .string 属性,但仅当文本是给定元素的 only 子元素时才设置该属性.如果存在其他元素,搜索将完全失败.

解决这个问题的方法是使用 lambda 来代替;它将传递整个元素,您可以验证每个元素:

soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)

演示:

<预><代码>>>>汤.find_all(lambda e: e.name == 'td' and 'Black' in e.text)[<td id="rp10" valign="top">黑人或非裔美国人,百分比,2013 (a) <!-- RHI225213 --></td>, <td id="re6" valign="top">黑人拥有的公司,百分比,2007 <!-- SBO315207 --></td>]

这两个匹配项在 <td> 元素中都有注释,使得使用 text 匹配项的搜索无效.

Looking at US Census QFD I'm trying to grab the race % by county. The loop I'm building is outside the scope of my question, which concerns this code:

url = 'http://quickfacts.census.gov/qfd/states/48/48507.html'
#last county in TX; for some reason the qfd #'s counties w/ only odd numbers
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

c_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[0] #c = county %
s_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[1] #s = state %

Which grabs the html element including its tags, not just the text within it:

c_black_alone, s_black_alone

(<td align="right" headers="rp9 p1" valign="bottom">96.9%<sup></sup></td>,
 <td align="right" headers="rp9 p2" valign="bottom">80.3%<sup></sup></td>)

Above ^, I only want the %'s inside the elements...

Furthermore, why does

test_black = soup.find_all("td", text = "Black")

not return the same element as above (or its text), but instead returns an empty bs4 ResultSet object? (Edit: I have been following along with the documentation, so I hope this question doesn't seem too vague...)

解决方案

To get the text from those matches, use .text to get all contained text:

>>> soup.find_all("td", attrs={'headers':'rp9'})[0].text
u'96.9%'
>>> soup.find_all("td", attrs={'headers':'rp9'})[1].text
u'80.3%'

Your text search doesn't match anything for two reasons:

  1. A literal string only matches the whole contained text, not a partial match. It'll only work for element with <td>Black</td> as the sole contents.
  2. It will use the .string property, but that property is only set if the text is the only child of a given element. If there are other elements present, the search will fail entirely.

The way around this is by using a lambda instead; it'll be passed the whole element and you can validate each element:

soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)

Demo:

>>> soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
[<td id="rp10" valign="top">Black or African American alone, percent, 2013 (a)  <!-- RHI225213 --> </td>, <td id="re6" valign="top">Black-owned firms, percent, 2007  <!-- SBO315207 --> </td>]

Both of these matches have a comment in the <td> element, making a search with a text match ineffective.

这篇关于为什么 bs4 返回标签,然后返回一个空列表到这个 find_all() 方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆