如何刮被Python动态网页 [英] How to scrape dynamic webpages by Python

查看：183 发布时间：2016/8/5 19:00:15 python html web-scraping beautifulsoup scrape

本文介绍了如何刮被Python动态网页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

大家。

刮网页低于二手车的数据。结果
<一href=\"http://www.goo-net.com/php/search/summary.php?price_range=&$p$pf_c=08,09,10,11,12,13,14&easysearch_flg=1\" rel=\"nofollow\">http://www.goo-net.com/php/search/summary.php?price_range=&$p$pf_c=08,09,10,11,12,13,14&easysearch_flg=1

Scrape the webpage below for used car data.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1

要凑了整个页面。在上面的网址，只有第一个30个项目被显示。这些可以由code下面我写刮掉。链接到其他网页都显示像1 2 3 ...但链接地址似乎是在Javascript中。我用Google搜索有用的信息，但找不到任何。

To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any.

from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")

soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string

# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
    href = heading_inner.find('h4').find('a').get('href')
    car_urls.append('http://www.goo-net.com' + href)

for url in car_urls:
    html = urllib.request.urlopen(url)
    soup = BeautifulSoup(html, "lxml")
    #title
    print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
    #price of car itself
    print(soup.find(class_='price1').string)
    #price of car including tax
    print(soup.find(class_='price2').string)

    tds = soup.find(class_='subData').find_all('td')
    # year
    print(tds[0].string)
    # distance
    print(tds[1].string)
    # displacement
    print(tds[2].string)
    # inspection
    print(tds[3].string)

[想什么我知道]

如何刮整个页面。我preFER使用BeautifulSoup4（蟒蛇）。但如果没有相应的工具，请告诉我其他的。

[What I'd like to know]

How to scrape the entire pages. I prefer to use BeautifulSoup4 (Python). But if that is not the appropriate tool, please show me other ones.

的Windows 8.1

的Python 3.5

的PyDev（Eclipse中）

BeautifulSoup4

任何指导，将AP preciated。谢谢你。

Any guidance would be appreciated. Thank you.

如何刮被Python动态网页 [英] How to scrape dynamic webpages by Python

问题描述

[想什么我知道]

[What I'd like to know]

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何刮被Python动态网页 [英] How to scrape dynamic webpages by Python

问题描述

[想什么我知道]

[What I'd like to know]

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭