如何通过 Python 抓取动态网页 [英] How to scrape dynamic webpages by Python

查看：33 发布时间：2021/12/17 13:27:38 python html web-scraping beautifulsoup scrape

本文介绍了如何通过 Python 抓取动态网页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

从下面的网页抓取二手车数据.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1

Scrape the webpage below for used car data.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1

抓取整个页面.在上面的 url 中，只显示了前 30 个项目.这些可以被我写的下面的代码刮掉.指向其他页面的链接显示为 1 2 3... 但链接地址似乎是在 Javascript 中.我在谷歌上搜索了有用的信息，但找不到任何信息.

To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any.

from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")

soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string

# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
    href = heading_inner.find('h4').find('a').get('href')
    car_urls.append('http://www.goo-net.com' + href)

for url in car_urls:
    html = urllib.request.urlopen(url)
    soup = BeautifulSoup(html, "lxml")
    #title
    print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
    #price of car itself
    print(soup.find(class_='price1').string)
    #price of car including tax
    print(soup.find(class_='price2').string)

    tds = soup.find(class_='subData').find_all('td')
    # year
    print(tds[0].string)
    # distance
    print(tds[1].string)
    # displacement
    print(tds[2].string)
    # inspection
    print(tds[3].string)

[我想知道的]

如何抓取整个页面.我更喜欢使用 BeautifulSoup4 (Python).但如果这不是合适的工具，请告诉我其他工具.

[What I'd like to know]

How to scrape the entire pages. I prefer to use BeautifulSoup4 (Python). But if that is not the appropriate tool, please show me other ones.

Windows 8.1
Python 3.5
PyDev (Eclipse)
BeautifulSoup4

任何指导将不胜感激.谢谢.

Any guidance would be appreciated. Thank you.

如何通过 Python 抓取动态网页 [英] How to scrape dynamic webpages by Python

问题描述

[我想知道的]

[What I'd like to know]

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何通过 Python 抓取动态网页 [英] How to scrape dynamic webpages by Python

问题描述

[我想知道的]

[What I'd like to know]

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭