Scrapy:按照分页链接抓取数据 [英] Scrapy: Following pagination link to scrape data

查看：47 发布时间：2021/7/16 22:25:16 python xpath web-scraping scrapy

本文介绍了Scrapy:按照分页链接抓取数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从页面中抓取数据并按照分页链接继续抓取.

I am trying to scrape data from a page and continue scraping following the pagination link.

我试图抓取的页面是 --> 这里

The page I am trying to scrape is --> here

# -*- coding: utf-8 -*-
import scrapy


class AlibabaSpider(scrapy.Spider):
    name = 'alibaba'
    allowed_domains = ['alibaba.com']
    start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']

def parse(self, response):
    for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
        item = {
            'product_name': products.xpath('.//h2/a/@title').extract_first(),
            'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
            'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
            'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
            'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
            'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
            #'image_url': products.xpath('.//div[@class=""]/').extract_first(),
         }
        yield item

    #Follow the paginatin link
    next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
    if next_page_url:
        yield scrapy.Request(url=next_page_url, callback=self.parse)

问题

代码无法跟随分页链接.

修改代码以跟随分页链接.

推荐答案

要使您的代码正常工作，您需要使用 response.follow() 或类似方法修复断开的链接.试试下面的方法.

To get your code working, you need to fix the broken link by using response.follow() or something similar. Try the below approach.

import scrapy

class AlibabaSpider(scrapy.Spider):
    name = 'alibaba'
    allowed_domains = ['alibaba.com']
    start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']

    def parse(self, response):
        for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
            item = {
            'product_name': products.xpath('.//h2/a/@title').extract_first(),
            'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
            'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
            'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
            'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
            'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
            #'image_url': products.xpath('.//div[@class=""]/').extract_first(),
            }
            yield item

        #Follow the paginatin link
        next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

您粘贴的代码缩进严重.我也解决了这个问题.

Your pasted code was badly indented. I've fixed that as well.

这篇关于Scrapy:按照分页链接抓取数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy:按照分页链接抓取数据 [英] Scrapy: Following pagination link to scrape data

问题描述

问题

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy:按照分页链接抓取数据 [英] Scrapy: Following pagination link to scrape data

问题描述

问题

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭