使用Scrapy的无限滚动刮网站 [英] Scraping Website With Infinite Scroll Using Scrapy

查看:105
本文介绍了使用Scrapy的无限滚动刮网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从





更新



我越来越近但不能似乎弄清楚我怎么可以简单地更新函数中的offset参数,该函数基本上是递归调用自身:

  class LetgoSpider(scrapy .Spider):
name ='letgo'
allowed_domains = ['letgo.com/en']
start_urls = ['https://search-products-pwa.letgo.com/ api / products?country_code = US& offset = 0& quadkey = 0320030123201& num_results = 50& distance_type = mi']

def parse(self,response):
data = js on.loads(response.text)
for used_item in data:
if len(data)== 0:
break
try:
title = used_item [' name']
price = used_item ['price']
description = used_item ['description']
date = used_item ['updated_at']
images = [img ['url '] for img in used_item ['images']]
latitude = used_item ['geo'] ['lat']
longitude = used_item ['geo'] ['lng']
除了例外:
通过

收益{'标题':标题,
'价格':价格,
'描述':描述,
'日期':日期,
'图片':图片,
'纬度':纬度,
'经度':经度
}

i = 0
for new_items_load作为回应:
i + = 50
offset = i
new_request ='https:// search-products-pwa。 letgo.com/api/products?country_code=US&offset='+ str(i)+ \
'& quadkey = 0320030123201& num_results = 50& distance_type = mi'
yield scrapy.Request (new_request,callback = self.parse)


解决方案

不确定如果我理解你的问题。



如果你只是需要知道如何定义参数,这可能是一种方式:

  let offset,num_results; 
for(let i = 0; i< max; i + = 15){
offset = i;
num_results = i + 15;
[使用参数值执行请求]
}


I am trying to scrape used items in my area from https://us.letgo.com/en for a personal project. I found this video helpful https://youtu.be/EelmnSzykyI. However, there are some subtle differences that the video doesn't help with.

The info I need is loaded asyncrhonously via json. The website loads 15 items per each scroll (except for the initial load which contains 30 items). The json object looks like this: https://search-products-pwa.letgo.com/api/products?country_code=US&offset=0&quadkey=0320030123201&num_results=30&distance_type=mi and the next 15 items to load look like this: https://search-products-pwa.letgo.com/api/products?country_code=US&offset=30&quadkey=0320030123201&num_results=15&distance_type=mi

When I load the first response data = json.loads(response.text) it returns a list of the 30 items. The first item looks like this:

{'attributes': None,
 'category_id': 5,
 'created_at': '2018-02-12T15:40:56+00:00',
 'currency': 'USD',
 'description': None,
 'featured': False,
 'geo': {'city': 'Asheville',
  'country_code': 'US',
  'distance': 1.1703344331099,
  'lat': 35.5889898,
  'lng': -82.5308015,
  'zip_code': '28805'},
 'id': '6080a1db-b2af-44c2-bfd8-4cc7f1ded17f',
 'image_information': 'brown ukulele',
 'images': [{'id': 'b8e78e2e-65c4-4062-b89e-c775ef9f6bc9',
   'url': 'https://img.letgo.com/images/ab/da/e2/f6/abdae2f68e34170d8f1f22d2473d1153.jpeg'}],
 'language_code': 'US',
 'name': None,
 'owner': {'avatar_url': '',
  'banned': False,
  'city': 'Asheville',
  'country_code': 'US',
  'id': 'fb0f8657-0273-4fac-ba77-9965a1dc8794',
  'is_richy': False,
  'name': 'Brock G',
  'status': 'active',
  'zip_code': '28805'},
 'price': 100,
 'price_flag': 2,
 'rejected': False,
 'status': 1,
 'thumb': {'height': 1280,
  'url': 'https://img.letgo.com/images/ab/da/e2/f6/abdae2f68e34170d8f1f22d2473d1153.jpeg?impolicy=img_200',
  'width': 960},
 'updated_at': '2018-02-12T15:41:34+00:00'}

My aim is to create a for loop and extract out each item and then move on to the next request that loads in an additional 15 items but I'm not sure how to do this. Please note that the additional request parameters are as follows:

Update

I am getting close but can't seem to figure out how I can simply update the offset parameter within the function that is essentially recursively calling itself:

class LetgoSpider(scrapy.Spider):
    name = 'letgo'
    allowed_domains = ['letgo.com/en']
    start_urls = ['https://search-products-pwa.letgo.com/api/products?country_code=US&offset=0&quadkey=0320030123201&num_results=50&distance_type=mi']

    def parse(self, response):
        data = json.loads(response.text)
        for used_item in data:
            if len(data) == 0:
                break
            try:
                title = used_item['name']
                price = used_item['price']
                description = used_item['description']
                date = used_item['updated_at']
                images = [img['url'] for img in used_item['images']]
                latitude = used_item['geo']['lat']
                longitude = used_item['geo']['lng']               
            except Exception:
                pass

        yield {'Title': title,
               'Price': price,
               'Description': description,
               'Date': date,
               'Images': images,
               'Latitude': latitude,
               'Longitude': longitude          
               }    

        i = 0
        for new_items_load in response:
            i += 50 
            offset = i
            new_request = 'https://search-products-pwa.letgo.com/api/products?country_code=US&offset=' + str(i) + \
                          '&quadkey=0320030123201&num_results=50&distance_type=mi'
            yield scrapy.Request(new_request, callback=self.parse)

解决方案

Not sure if I understand well your question.

If you just need to know how to define the parameters, this could be a way:

  let offset, num_results;
  for(let i = 0; i < max; i += 15) {
    offset = i;
    num_results = i + 15;
    [do the request with the parameters values]
  }

这篇关于使用Scrapy的无限滚动刮网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆