使用scrapy获取url列表，然后在这些url中抓取内容 [英] Use scrapy to get list of urls, and then scrape content inside those urls

查看：81 发布时间：2021/7/16 21:50:28 python web-scraping scrapy

本文介绍了使用scrapy获取url列表，然后在这些url中抓取内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要一个 Scrapy 蜘蛛来抓取以下页面(https://www.phidgets.com/?tier=1&catid=64&pcid=57) 对于每个 URL(30 个产品，所以 30 个 URL)，然后通过该 URL 进入每个产品并抓取其中的数据.

I need a Scrapy spider to scrape the following page (https://www.phidgets.com/?tier=1&catid=64&pcid=57) for each URL (30 products, so 30 urls) and then go into each product via that url and scrape the data inside.

我让第二部分完全按照我的意愿工作:

I have the second part working exactly as I want:

import scrapy

class ProductsSpider(scrapy.Spider):
    name = "products"
    start_urls = [
        'https://www.phidgets.com/?tier=1&catid=64&pcid=57',
    ]

    def parse(self, response):
        for info in response.css('div.ph-product-container'):
            yield {
                'product_name': info.css('h2.ph-product-name::text').extract_first(),
                'product_image': info.css('div.ph-product-img-ctn a').xpath('@href').extract(),
                'sku': info.css('span.ph-pid').xpath('@prod-sku').extract_first(),
                'short_description': info.css('div.ph-product-summary::text').extract_first(),
                'price': info.css('h2.ph-product-price > span.price::text').extract_first(),
                'long_description': info.css('div#product_tab_1').extract_first(),
                'specs': info.css('div#product_tab_2').extract_first(),
            }

        # next_page = response.css('div.ph-summary-entry-ctn a::attr("href")').extract_first()
        # if next_page is not None:
        #     yield response.follow(next_page, self.parse)

但我不知道如何做第一部分.正如您将看到的，我有主页 (https://www.phidgets.com/?tier=1&catid=64&pcid=57) 设置为 start_url.但是我如何让它用我需要抓取的所有 30 个 url 填充 start_urls 列表?

But I don't know how to do the first part. As you will see I have the main page (https://www.phidgets.com/?tier=1&catid=64&pcid=57) set as the start_url. But how do I get it to populate the start_urls list with all 30 urls I need crawled?

推荐答案

我目前无法测试，所以请告诉我这是否适合您，以便我可以在有任何错误时对其进行编辑.

I am not able to test at this moment, so please let me know if this works for you so I can edit it should there be any bugs.

这里的想法是我们找到第一页中的每个链接并产生新的scrapy请求，将您的产品解析方法作为回调传递

The idea here is that we find every link in the first page and yield new scrapy requests passing your product parsing method as a callback

import scrapy
from urllib.parse import urljoin

class ProductsSpider(scrapy.Spider):
    name = "products"
    start_urls = [
        'https://www.phidgets.com/?tier=1&catid=64&pcid=57',
    ]

    def parse(self, response):
        products = response.xpath("//*[contains(@class, 'ph-summary-entry-ctn')]/a/@href").extract()
        for p in products:
            url = urljoin(response.url, p)
            yield scrapy.Request(url, callback=self.parse_product)

    def parse_product(self, response):
        for info in response.css('div.ph-product-container'):
            yield {
                'product_name': info.css('h2.ph-product-name::text').extract_first(),
                'product_image': info.css('div.ph-product-img-ctn a').xpath('@href').extract(),
                'sku': info.css('span.ph-pid').xpath('@prod-sku').extract_first(),
                'short_description': info.css('div.ph-product-summary::text').extract_first(),
                'price': info.css('h2.ph-product-price > span.price::text').extract_first(),
                'long_description': info.css('div#product_tab_1').extract_first(),
                'specs': info.css('div#product_tab_2').extract_first(),
            }

这篇关于使用scrapy获取url列表，然后在这些url中抓取内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用scrapy获取url列表，然后在这些url中抓取内容 [英] Use scrapy to get list of urls, and then scrape content inside those urls

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用scrapy获取url列表，然后在这些url中抓取内容 [英] Use scrapy to get list of urls, and then scrape content inside those urls

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭