如何在scrapy中跨多个站点获取单个项目? [英] How to get a single item across many sites in scrapy?

查看:86
本文介绍了如何在scrapy中跨多个站点获取单个项目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这种情况:

我想从描述产品的特定产品详细信息页面(页面 A)抓取产品详细信息,该页面包含指向列出该产品卖家的页面(页面 B)的链接,在每个卖家中都有一个链接到另一个包含卖家详细信息的页面(页面 C),这是一个示例架构:

I want to crawl products details from a specific product detail page which describes the product (Page A), this page contains a link to a page that list sellers of this product (Page B), in each seller is a link to another page (Page C) which contains seller details, here is an example schema:

页面 A:

  • 产品名称
  • 链接到该产品的卖家(页面 B)

页面 B:

  • 卖家列表,每一个都包含:
    • 卖家名称
    • sell_price
    • 链接到卖家详细信息页面(页面 C)

    页面 C:

    • seller_address

    这是我爬取后想要获取的json:

    This is the json I want to obtain after crawling:

    {
      "product_name": "product1",
      "sellers": [
        {
          "seller_name": "seller1",
          "seller_price": 100,
          "seller_address": "address1",
        },
        (...)
      ]
    }
    

    我尝试过的:将产品信息从 in parse 方法传递到元对象中的第二个 parse 方法,这在 2 个级别上工作正常,但我有 3 个级别,我想要一个项目.

    What I have tried: passing the product information from in parse method to second parse method in meta object, this works fine on 2 levels, but I have 3, and I want a single item.

    这在scrapy中可行吗?

    Is this possible in scrapy?

    这里要求的是我正在尝试做的一个缩小的例子,我知道它不会按预期工作,但我不知道如何让它只返回 1 个组合对象:

    as requested here is a minified example of what I am trying to do, I know it wont work as expected, but I can not figure out how to make it return only 1 composed object:

    import scrapy
    
    class ExampleSpider(scrapy.Spider):
        name = 'examplespider'
        allowed_domains = ["example.com"]
    
        start_urls = [
            'http://example.com/products/product1'
        ]
    
        def parse(self, response):
    
            # assume this object was obtained after
            # some xpath processing
            product_name = 'product1'
            link_to_sellers = 'http://example.com/products/product1/sellers'
    
            yield scrapy.Request(link_to_sellers, callback=self.parse_sellers, meta={
                'product': {
                    'product_name': product_name,
                    'sellers': []
                }
            })
    
        def parse_sellers(self, response):
            product = response.meta['product']
    
            # assume this object was obtained after
            # some xpath processing
            sellers = [
                {
                    seller_name = 'seller1',
                    seller_price = 100,
                    seller_detail_url = 'http://example.com/sellers/seller1',
                },
                {
                    seller_name = 'seller2',
                    seller_price = 100,
                    seller_detail_url = 'http://example.com/sellers/seller2',
                },
                {
                    seller_name = 'seller3',
                    seller_price = 100,
                    seller_detail_url = 'http://example.com/sellers/seller3',
                }
            ]
    
            for seller in sellers:
                product['sellers'].append(seller)
                yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})
    
        def parse_seller(self, response):
            seller = response.meta['seller']
    
            # assume this object was obtained after
            # some xpath processing
            seller_address = 'seller_address1'
    
            seller['seller_address'] = seller_address
    
            yield seller
    

    推荐答案

    你需要稍微改变你的逻辑,以便它一次只查询一个卖家地址,一旦完成你就可以查询其他卖家.

    You need to change your logic a bit, so as it to query one seller address at a time only and once that completes you query other sellers.

    def parse_sellers(self, response):
        meta = response.meta
    
        # assume this object was obtained after
        # some xpath processing
        sellers = [
            {
                seller_name = 'seller1',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller1',
            },
            {
                seller_name = 'seller2',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller2',
            },
            {
                seller_name = 'seller3',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller3',
            }
        ]
    
        current_seller = sellers.pop()
        if current_seller:
           meta['pending_sellers'] = sellers
           meta['current_seller'] = current_seller
           yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
        else:
           yield product
    
    
        # for seller in sellers:
        #     product['sellers'].append(seller)
        #     yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})
    
    def parse_seller(self, response):
        meta = response.meta
        current_seller = meta['current_seller']
        sellers = meta['pending_sellers']
        # assume this object was obtained after
        # some xpath processing
        seller_address = 'seller_address1'
    
        current_seller['seller_address'] = seller_address
    
        meta['product']['sellers'].append(current_seller)
        if sellers:
            current_seller = sellers.pop()
            meta['pending_sellers'] = sellers
            meta['current_seller'] = current_seller
    
            yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
        else:
            yield meta['product']
    

    但这并不是一个很好的方法,因为卖家可能会销售多种商品.因此,当您再次到达同一卖家的商品时,您对卖家地址的请求将被欺骗过滤器拒绝.您可以通过在请求中添加 dont_filter=True 来解决这个问题,但这意味着网站有太多不必要的点击

    But this is till not a great approach, reason being a seller may be selling multiple items. So when you reach the a item by same seller again then your request for seller address would get rejected by dupe filter. You can fix that by adding dont_filter=True to the request but that would mean too many unnecessary hits to the website

    因此您需要直接在代码中添加 DB 处理以检查您是否已经有卖家详细信息,如果有则使用它们,如果没有则需要获取详细信息.

    So you need to add DB handling directly in code to check if you already have a sellers details, if yes then use them, if not then you need fetch the details.

    这篇关于如何在scrapy中跨多个站点获取单个项目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆