如何刮被Python动态网页 [英] How to scrape dynamic webpages by Python

查看:183
本文介绍了如何刮被Python动态网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家。

刮网页低于二手车的数据。结果
<一href=\"http://www.goo-net.com/php/search/summary.php?price_range=&$p$pf_c=08,09,10,11,12,13,14&easysearch_flg=1\" rel=\"nofollow\">http://www.goo-net.com/php/search/summary.php?price_range=&$p$pf_c=08,09,10,11,12,13,14&easysearch_flg=1

Scrape the webpage below for used car data.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1

要凑了整个页面。在上面的网址,只有第一个30个项目被显示。这些可以由code下面我写刮掉。链接到其他网页都显示像1 2 3 ...但链接地址似乎是在Javascript中。我用Google搜索有用的信息,但找不到任何。

To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any.

from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")

soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string

# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
    href = heading_inner.find('h4').find('a').get('href')
    car_urls.append('http://www.goo-net.com' + href)

for url in car_urls:
    html = urllib.request.urlopen(url)
    soup = BeautifulSoup(html, "lxml")
    #title
    print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
    #price of car itself
    print(soup.find(class_='price1').string)
    #price of car including tax
    print(soup.find(class_='price2').string)

    tds = soup.find(class_='subData').find_all('td')
    # year
    print(tds[0].string)
    # distance
    print(tds[1].string)
    # displacement
    print(tds[2].string)
    # inspection
    print(tds[3].string)

[想什么我知道]

如何刮整个页面。我preFER使用BeautifulSoup4(蟒蛇)。但如果没有相应的工具,请告诉我其他的。

[What I'd like to know]

How to scrape the entire pages. I prefer to use BeautifulSoup4 (Python). But if that is not the appropriate tool, please show me other ones.


  • 的Windows 8.1

  • 的Python 3.5

  • 的PyDev(Eclipse中)

  • BeautifulSoup4

任何指导,将AP preciated。谢谢你。

Any guidance would be appreciated. Thank you.

推荐答案

您可以使用像下面的示例:

you can use selenium like below sample:

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://example.com')
element = driver.find_element_by_class_name("yourClassName") #or find by text or etc
element.click() 

这篇关于如何刮被Python动态网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆