刮痧Python中的网站的第二页不起作用 [英] Scraping the second page of a website in Python does not work

查看：193 发布时间：2016/8/5 19:16:47 python python-2.7 web-scraping beautifulsoup urlopen

本文介绍了刮痧Python中的网站的第二页不起作用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

比方说，我想在这里刮的数据。

Let's say I want to scrape the data here.

我可以说好听用做的urlopen 和 BeautifulSoup 在Python 2.7。

I can do it nicely using urlopen and BeautifulSoup in Python 2.7.

现在，如果我想从的这个地址。

我得到的是从第一页中的数据！我看了看第二页的页面源代码使用Chrome的查看页面源代码和内容属于第一页！

What I get is the data from the first page! I looked at the page source of the second page using "view page source" of Chrome and the content belongs to first page!

我怎样才能从第二页刮去数据？

How can I scrape the data from the second page?

推荐答案

该页面是一个相当异步性，有形成搜索结果XHR请求，使用模拟它们在你的code 请求。示例code作为一个起点你：

The page is of a quite asynchronous nature, there are XHR requests forming the search results, simulate them in your code using requests. Sample code as a starting point for you:

from bs4 import BeautifulSoup
import requests

url = 'http://www.amazon.com/Best-Sellers-Books-Architecture/zgbs/books/173508/#2'
ajax_url = "http://www.amazon.com/Best-Sellers-Books-Architecture/zgbs/books/173508/ref=zg_bs_173508_pg_2"

def get_books(data):
    soup = BeautifulSoup(data)

    for title in soup.select("div.zg_itemImmersion div.zg_title a"):
        print title.get_text(strip=True)


with requests.Session() as session:
    session.get(url)

    session.headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
        'X-Requested-With': 'XMLHttpRequest'
    }

    for page in range(1, 10):
        print "Page #%d" % page

        params = {
            "_encoding": "UTF8",
            "pg": str(page),
            "ajax": "1"
        }
        response = session.get(ajax_url, params=params)
        get_books(response.content)

        params["isAboveTheFold"] = "0"
        response = session.get(ajax_url, params=params)
        get_books(response.content)

不要忘了用<一个href=\"http://programmers.stackexchange.com/questions/91760/how-to-be-a-good-citizen-when-crawling-web-sites\">good网络刮公民，并按照使用条款。

这篇关于刮痧Python中的网站的第二页不起作用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

刮痧Python中的网站的第二页不起作用 [英] Scraping the second page of a website in Python does not work

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

刮痧Python中的网站的第二页不起作用 [英] Scraping the second page of a website in Python does not work

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭