为什么 bs4 返回标签,然后返回一个空列表到这个 find_all() 方法? [英] Why does bs4 return tags and then an empty list to this find_all() method?
问题描述
查看 美国人口普查 QFD 我试图抓住按县划分的种族百分比.我正在构建的循环超出了我的问题范围,这涉及此代码:
url = 'http://quickfacts.census.gov/qfd/states/48/48507.html'#德克萨斯州的最后一个县;出于某种原因,qfd # 的县只有奇数页面 = urllib2.urlopen(url)汤 = BeautifulSoup(页面)c_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[0] #c = 县 %s_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[1] #s = state %
抓取包含其标签的 html 元素,而不仅仅是其中的文本:
c_black_alone, s_black_alone(<td align="right" headers="rp9 p1" valign="bottom">96.9%<sup></sup></td>,<td align="right" headers="rp9 p2" valign="bottom">80.3%<sup></sup></td>)
在 ^ 以上,我只想要元素内的 %...
此外,为什么
test_black = soup.find_all("td", text = "Black")
不返回与上面相同的元素(或其文本),而是返回一个空的 bs4 ResultSet 对象?(我一直在关注文档,所以我希望这个问题看起来不太模糊......)
要从这些匹配中获取文本,请使用 .text
获取所有包含的文本:><预><代码>>>>汤.find_all("td", attrs={'headers':'rp9'})[0].text你'96.9%'>>>汤.find_all("td", attrs={'headers':'rp9'})[1].text你'80.3%'
您的 text
搜索不匹配任何内容,原因有两个:
- 文字字符串只匹配整个包含的文本,而不是部分匹配.它仅适用于
<td>Black</td>
作为 sole 内容的元素. - 它将使用
.string
属性,但仅当文本是给定元素的 only 子元素时才设置该属性.如果存在其他元素,搜索将完全失败.
解决这个问题的方法是使用 lambda 来代替;它将传递整个元素,您可以验证每个元素:
soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
演示:
<预><代码>>>>汤.find_all(lambda e: e.name == 'td' and 'Black' in e.text)[<td id="rp10" valign="top">黑人或非裔美国人,百分比,2013 (a) <!-- RHI225213 --></td>, <td id="re6" valign="top">黑人拥有的公司,百分比,2007 <!-- SBO315207 --></td>]这两个匹配项在 <td>
元素中都有注释,使得使用 text
匹配项的搜索无效.
Looking at US Census QFD I'm trying to grab the race % by county. The loop I'm building is outside the scope of my question, which concerns this code:
url = 'http://quickfacts.census.gov/qfd/states/48/48507.html'
#last county in TX; for some reason the qfd #'s counties w/ only odd numbers
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
c_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[0] #c = county %
s_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[1] #s = state %
Which grabs the html element including its tags, not just the text within it:
c_black_alone, s_black_alone
(<td align="right" headers="rp9 p1" valign="bottom">96.9%<sup></sup></td>,
<td align="right" headers="rp9 p2" valign="bottom">80.3%<sup></sup></td>)
Above ^, I only want the %'s inside the elements...
Furthermore, why does
test_black = soup.find_all("td", text = "Black")
not return the same element as above (or its text), but instead returns an empty bs4 ResultSet object? (Edit: I have been following along with the documentation, so I hope this question doesn't seem too vague...)
To get the text from those matches, use .text
to get all contained text:
>>> soup.find_all("td", attrs={'headers':'rp9'})[0].text
u'96.9%'
>>> soup.find_all("td", attrs={'headers':'rp9'})[1].text
u'80.3%'
Your text
search doesn't match anything for two reasons:
- A literal string only matches the whole contained text, not a partial match. It'll only work for element with
<td>Black</td>
as the sole contents. - It will use the
.string
property, but that property is only set if the text is the only child of a given element. If there are other elements present, the search will fail entirely.
The way around this is by using a lambda instead; it'll be passed the whole element and you can validate each element:
soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
Demo:
>>> soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
[<td id="rp10" valign="top">Black or African American alone, percent, 2013 (a) <!-- RHI225213 --> </td>, <td id="re6" valign="top">Black-owned firms, percent, 2007 <!-- SBO315207 --> </td>]
Both of these matches have a comment in the <td>
element, making a search with a text
match ineffective.
这篇关于为什么 bs4 返回标签,然后返回一个空列表到这个 find_all() 方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!