使用无限滚动进行Web抓取的更新参数 [英] Update Parameter for Web Scraping With Infinite Scroll

查看:92
本文介绍了使用无限滚动进行Web抓取的更新参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定如何在这里构造代码,以便每次函数递归调用自身时,offset参数都会更新.这里是有关我的脚本和我要解决的挑战的更多详细信息.我觉得这里缺少一些简单的解决方法. 使用Scrapy使用无限滚动来抓取网站

I am unsure how I should structure my code here so that the offset parameter updates each time the function recursively calls itself. Here is more detail about my script and the challenge I'm trying to solve. I feel like there is some easy fix that I'm missing here. Scraping Website With Infinite Scroll Using Scrapy

import scrapy
import json
import requests

class LetgoSpider(scrapy.Spider):
    name = 'letgo'
    allowed_domains = ['letgo.com/en']
    start_urls = ['https://search-products-pwa.letgo.com/api/products?country_code=US&offset=0&quadkey=0320030123201&num_results=50&distance_type=mi']

    def parse(self, response):
        data = json.loads(response.text)
        for used_item in data:
            if len(data) == 0:
                break
            try:
                title = used_item['name']
                price = used_item['price']
                description = used_item['description']
                date = used_item['updated_at']
                images = [img['url'] for img in used_item['images']]
                latitude = used_item['geo']['lat']
                longitude = used_item['geo']['lng']               
            except Exception:
                pass

        yield {'Title': title,
               'Price': price,
               'Description': description,
               'Date': date,
               'Images': images,
               'Latitude': latitude,
               'Longitude': longitude          
               }    

        i = 0
        for new_items_load in response:
            i += 50 
            offset = i
            new_request = 'https://search-products-pwa.letgo.com/api/products?country_code=US&offset=' + str(i) + \
                          '&quadkey=0320030123201&num_results=50&distance_type=mi'
            yield scrapy.Request(new_request, callback=self.parse)

推荐答案

将偏移量定义为类属性:

Define offset as a class attribute:

class LetgoSpider(scrapy.Spider):
    name = 'letgo'
    allowed_domains = ['letgo.com/en']
    start_urls = ['https://search-products-pwa.letgo.com/api/products?country_code=US&offset=0&quadkey=0320030123201&num_results=50&distance_type=mi']
    offset = 0  # <- here

然后,您可以使用self.offset引用它,并且该值将在所有parse调用的函数之间共享.所以会是这样的:

Then, you can refer to it using self.offset and the value will be shared across all function parse invokes. So it'd be something like this:

self.offset += 50
new_request = 'https://search-products-pwa.letgo.com/api/products?country_code=US&offset=' + str(self.offset) + \
                      '&quadkey=0320030123201&num_results=50&distance_type=mi'
yield scrapy.Request(new_request, callback=self.parse)

这篇关于使用无限滚动进行Web抓取的更新参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆