如果搜索产生的结果比显示的结果多，则刮掉隐藏的页面 [英] scrape hidden pages if search yields more results than displayed

查看：57 发布时间：2020/9/20 6:37:19 python web-scraping beautifulsoup

本文介绍了如果搜索产生的结果比显示的结果多，则刮掉隐藏的页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 https://www.comparis.ch/carfinder/default <下输入的某些搜索查询会产生超过1'000个结果(在搜索页面上动态显示).但是，结果最多只能显示100页，每页包含10个结果，因此，对于查询产生超过1，000个结果的查询，我将尝试抓取其余数据. 刮擦前100页ID的代码是(大约需要2分钟才能浏览全部100页):

Some of the search queries entered under https://www.comparis.ch/carfinder/default would yield more than 1'000 results (shown dynamically on the search page). The results however only show a max of 100 pages with 10 results each so I'm trying to scrape the remaining data given a query that yields more than 1'000 results. The code to scrape the IDs of the first 100 pages is (takes approx. 2 minutes to run through all 100 pages):

from bs4 import BeautifulSoup
import requests

# as the max number of pages is limited to 100
number_of_pages = 100

# initiate empty dict
car_dict = {}

# parse every search results page and extract every car ID
for page in range(0, number_of_pages + 1, 1):
    newest_secondhand_cars = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
    newest_secondhand_cars = requests.get(newest_secondhand_cars + str('?page=') + str(page))
    newest_secondhand_cars = newest_secondhand_cars.content
    soup = BeautifulSoup(newest_secondhand_cars, "lxml")

    for car in list(soup.find('div', {'id': 'cf-result-list'}).find_all('h2')):
        car_id = int(car.decode().split('href="')[1].split('">')[0].split('/')[-1])
        car_dict[car_id] = {}

因此，我显然尝试过传递大于100的str(page)，但不会产生其他结果. 如果有的话，我怎样才能获得剩余的结果?

So I obviously tried just passing a str(page) greater than 100 which does not yield additional results. How could I access the remaining results, if at all?

推荐答案

似乎您的网站在客户端浏览时会加载数据.可能有多种方法可以解决此问题.一种选择是利用 Scrapy Splash .

It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.

假设您使用scrapy，则可以执行以下操作:

Assuming you use scrapy, you can do the following:

使用docker启动Splash服务器-记下
在settings.py中添加SPLASH_URL = <splash-server-ip-address>
在settings.py中添加到中间件

Start a Splash server using docker - make a note of the
In settings.py add SPLASH_URL = <splash-server-ip-address>
In settings.py add to middlewares

此代码:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

将from scrapy_splash import SplashRequest导入您的spider.py
在spider.py中设置start_url以遍历页面

Import from scrapy_splash import SplashRequest in your spider.py
Set start_url in your spider.py to iterate over the pages

例如像这样

base_url = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
start_urls = [
     base_url + str('?page=') + str(page) % page for page in range(0,100)      
    ]

通过修改def start_requests(self):

Redirect the url to the splash server by modifing def start_requests(self):

例如像这样

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.html',
            args={'wait': 0.5},
        )

像现在一样解析响应.

让我知道如何为您解决问题.

Let me know how that works out for you.

这篇关于如果搜索产生的结果比显示的结果多，则刮掉隐藏的页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如果搜索产生的结果比显示的结果多，则刮掉隐藏的页面 [英] scrape hidden pages if search yields more results than displayed

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如果搜索产生的结果比显示的结果多，则刮掉隐藏的页面 [英] scrape hidden pages if search yields more results than displayed

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭