使用scrapy抓取许多页面 [英] Scraping many pages using scrapy

查看：79 发布时间：2021/7/16 21:53:26 python web-scraping scrapy

本文介绍了使用scrapy抓取许多页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 scrapy 抓取多个网页.页面链接如下:

I am trying to scrape multiple webpages using scrapy. The link of the pages are like:

http://www.example.com/id=some-number

在下一页末尾的数字减少1.

In the next page the number at the end is reduced by 1.

所以我正在尝试构建一个蜘蛛，它可以导航到其他页面并抓取它们.我的代码如下:

So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:

import scrapy
import requests
from scrapy.http import Request

URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]

    def start_request(self):
        for i in range (starting_number, number_of_pages, -1):
            yield Request(url = URL % i, callback = self.parse)

    def parse(self, response):
        **parsing data from the webpage**

这进入了一个无限循环，在打印页码时我得到了负数.我认为这是因为我在我的 parse() 函数中请求一个页面.

This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse() function.

但是此处给出的示例可以正常工作.我哪里出错了?

But then the example given here works okay. Where am I going wrong?

推荐答案

请求的第一页是http://www.example.com/id=1000" (starting_number)

The first page requested is "http://www.example.com/id=1000" (starting_number)

它的响应通过 parse() 和 for i in range (0, 500):您正在请求 http://www.example.com/id=999、http://www.example.com/id=998、http://www.example.com/id=997...http://www.example.com/id=500

It's response goes through parse() and with for i in range (0, 500): you are requesting http://www.example.com/id=999, http://www.example.com/id=998, http://www.example.com/id=997...http://www.example.com/id=500

self.page_number 是一个蜘蛛属性，所以当你递减它的值时，你有 self.page_number == 500 在第一个 parse().

self.page_number is a spider attribute, so when you're decrementing it's value, you have self.page_number == 500 after the first parse().

因此，当 Scrapy 为 http://www.example.com/id=999 的响应调用 parse 时，您正在生成对 http 的请求://www.example.com/id=499、http://www.example.com/id=498、http://www.example.com/id=497...http://www.example.com/id=0

So when Scrapy calls parse for the response of http://www.example.com/id=999, you're generating requests for http://www.example.com/id=499, http://www.example.com/id=498, http://www.example.com/id=497...http://www.example.com/id=0

你猜第三次会发生什么:http://www.example.com/id=-1, http://www.example.com/id=-2...http://www.example.com/id=-500

You guess what happens the 3rd time: http://www.example.com/id=-1, http://www.example.com/id=-2...http://www.example.com/id=-500

对于每个响应，您将生成 500 个请求.

For each response, you're generating 500 requests.

您可以通过测试 self.page_number >= 0

在评论中的 OP 问题之后

Edit after OP question in comments:

不需要多线程，Scrapy 是异步工作的，您可以将所有请求放入一个重写的 start_requests() 方法中(而不是请求 1 页，然后返回 Request> parse 方法中的实例).Scrapy 将接受足够多的请求来填充它的管道、解析页面、选择要发送的新请求等等.

No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden start_requests() method (instead of requesting 1 page, and then returning Request istances in the parse method). Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.

请参阅start_requests 文档.

这样的事情会起作用:

class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]
    def __init__(self):
        self.page_number = starting_number

    def start_requests(self):
        # generate page IDs from 1000 down to 501
        for i in range (self.page_number, number_of_pages, -1):
            yield Request(url = URL % i, callback=self.parse)

    def parse(self, response):
        **parsing data from the webpage**

这篇关于使用scrapy抓取许多页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用scrapy抓取许多页面 [英] Scraping many pages using scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用scrapy抓取许多页面 [英] Scraping many pages using scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭