使用scrapy抓取许多页面 [英] Scraping many pages using scrapy
问题描述
我正在尝试使用 scrapy 抓取多个网页.页面链接如下:
I am trying to scrape multiple webpages using scrapy. The link of the pages are like:
http://www.example.com/id=some-number
在下一页末尾的数字减少1.
In the next page the number at the end is reduced by 1.
所以我正在尝试构建一个蜘蛛,它可以导航到其他页面并抓取它们.我的代码如下:
So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:
import scrapy
import requests
from scrapy.http import Request
URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def start_request(self):
for i in range (starting_number, number_of_pages, -1):
yield Request(url = URL % i, callback = self.parse)
def parse(self, response):
**parsing data from the webpage**
这进入了一个无限循环,在打印页码时我得到了负数.我认为这是因为我在我的 parse()
函数中请求一个页面.
This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse()
function.
但是此处给出的示例可以正常工作.我哪里出错了?
But then the example given here works okay. Where am I going wrong?
推荐答案
请求的第一页是http://www.example.com/id=1000" (starting_number
)
The first page requested is "http://www.example.com/id=1000" (starting_number
)
它的响应通过 parse()
和 for i in range (0, 500):
您正在请求 http://www.example.com/id=999
、http://www.example.com/id=998
、http://www.example.com/id=997
...http://www.example.com/id=500
It's response goes through parse()
and with for i in range (0, 500):
you are requesting http://www.example.com/id=999
, http://www.example.com/id=998
, http://www.example.com/id=997
...http://www.example.com/id=500
self.page_number
是一个蜘蛛属性,所以当你递减它的值时,你有 self.page_number == 500
在第一个 parse()
.
self.page_number
is a spider attribute, so when you're decrementing it's value, you have self.page_number == 500
after the first parse()
.
因此,当 Scrapy 为 http://www.example.com/id=999
的响应调用 parse
时,您正在生成对 http 的请求://www.example.com/id=499
、http://www.example.com/id=498
、http://www.example.com/id=497
...http://www.example.com/id=0
So when Scrapy calls parse
for the response of http://www.example.com/id=999
, you're generating requests for http://www.example.com/id=499
, http://www.example.com/id=498
, http://www.example.com/id=497
...http://www.example.com/id=0
你猜第三次会发生什么:http://www.example.com/id=-1
, http://www.example.com/id=-2
...http://www.example.com/id=-500
You guess what happens the 3rd time: http://www.example.com/id=-1
, http://www.example.com/id=-2
...http://www.example.com/id=-500
对于每个响应,您将生成 500 个请求.
For each response, you're generating 500 requests.
您可以通过测试 self.page_number >= 0
在评论中的 OP 问题之后
Edit after OP question in comments:
不需要多线程,Scrapy 是异步工作的,您可以将所有请求放入一个重写的 start_requests()
方法中(而不是请求 1 页,然后返回 Request
> parse
方法中的实例).Scrapy 将接受足够多的请求来填充它的管道、解析页面、选择要发送的新请求等等.
No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden start_requests()
method (instead of requesting 1 page, and then returning Request
istances in the parse
method).
Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.
这样的事情会起作用:
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_requests(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
**parsing data from the webpage**
这篇关于使用scrapy抓取许多页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!