使用美丽汤来查找特定班级 [英] Using Beautiful Soup to find specific class

查看:49
本文介绍了使用美丽汤来查找特定班级的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Beautiful Soup来从Zillow抓取房价数据.

我通过属性ID获取网页,例如. http://www.zillow.com/homes/for_sale/18429834_zpid/

当我尝试使用find_all()函数时,没有得到任何结果:

results = soup.find_all('div', attrs={"class":"home-summary-row"})

但是,如果我采用HTML并将其缩减为所需的位数,例如:

<html>
    <body>
        <div class=" status-icon-row for-sale-row home-summary-row">
        </div>
        <div class=" home-summary-row">
            <span class=""> $1,342,144 </span>
        </div>
    </body>
</html>

我得到2个结果,两个都是<div>类别为home-summary-row的结果.所以,我的问题是,为什么搜索整个页面时没有得到任何结果?


工作示例:

from bs4 import BeautifulSoup
import requests

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> $1,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")

results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)

解决方案

根据

这种标记会使BeautifulSoup解析HTML变得更加困难.

您可能想要尝试运行一些清理HTML的内容,例如删除换行符和每行末尾的空格. BeautifulSoup还可以为您清理HTML树:

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

I am trying to use Beautiful Soup to scrape housing price data from Zillow.

I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/

When I try the find_all() function, I do not get any results:

results = soup.find_all('div', attrs={"class":"home-summary-row"})

However, if I take the HTML and cut it down to just the bits I want, eg.:

<html>
    <body>
        <div class=" status-icon-row for-sale-row home-summary-row">
        </div>
        <div class=" home-summary-row">
            <span class=""> $1,342,144 </span>
        </div>
    </body>
</html>

I get 2 results, both <div>s with the class home-summary-row. So, my question is, why do I not get any results when searching the full page?


Working example:

from bs4 import BeautifulSoup
import requests

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> $1,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")

results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)

解决方案

According to the W3.org Validator, there are a number of issues with the HTML such as stray closing tags and tags split across multiple lines. For example:

<a 
href="http://www.zillow.com/danville-ca-94526/sold/"  title="Recent home sales" class=""  data-za-action="Recent Home Sales"  >

This kind of markup can make it much more difficult for BeautifulSoup to parse the HTML.

You may want to try running something to clean up the HTML, such as removing the line breaks and trailing spaces from the end of each line. BeautifulSoup can also clean up the HTML tree for you:

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

这篇关于使用美丽汤来查找特定班级的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆