带有长 start_urls 列表的 Scrapy Crawling URL 的顺序和来自蜘蛛的 urls yiels [英] the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider
问题描述
帮助!阅读Scrapy
的源代码对我来说并不容易.我有一个很长的 start_urls
列表.一个文件中大约有 3,000,000 个.所以,我让 start_urls
像这样:
start_urls = read_urls_from_file(u"XXXX")def read_urls_from_file(file_path):使用 codecs.open(file_path, u"r", encoding=u"GB18030") 作为 f:对于 f 中的行:尝试:url = line.strip()产量网址除了:打印 u"从文件读取行:%s 失败!"% 线继续打印 u"文件读取完成!"
同时,我的蜘蛛的回调函数是这样的:
def parse(self, response):self.log("访问了 %s" % response.url)return Request(url=("http://www.baidu.com"), callback=self.just_test1)def just_test1(self, response):self.log("访问了 %s" % response.url)返回请求(url=(http://www.163.com"),回调=self.just_test2)def just_test2(self, response):self.log("访问了 %s" % response.url)返回 []
我的问题是:
- 下载器使用的 url 的顺序?是否会提出请求
just_test1
,just_test2
仅在下载后使用start_urls
用了吗?(我做了一些测试,似乎答案是否定的) - 什么决定了顺序?为什么以及这个顺序如何?我们如何控制它?
- 这是处理文件中已有的如此多网址的好方法吗?还有什么?
非常感谢!!!
谢谢大家的回答,不过我还是有点糊涂:作者默认情况下,Scrapy 使用 LIFO 队列来存储待处理的请求.
- 蜘蛛的回调函数发出的
requests
会交给scheduler
.谁对start_url的请求
做同样的事情?蜘蛛start_requests()
函数只生成一个迭代器而不给出真正的要求. - 所有
请求
(start_url的和回调的)都在同一个请求的队列中吗?Scrapy
中有多少个队列?
首先请看这个thread - 我想你会在那里找到所有的答案.
<块引用>下载器使用的 url 的顺序?是否会提出请求just_test1,just_test2 只能被下载者使用用start_urls?(我做了一些测试,似乎答案是是否)
你说得对,答案是否
.行为是完全异步的:当蜘蛛启动时,start_requests
方法被调用(来源):
def start_requests(self):对于 self.start_urls 中的 url:产量 self.make_requests_from_url(url)def make_requests_from_url(self, url):返回请求(网址,dont_filter=True)
<块引用>
什么决定了顺序?为什么以及这个顺序如何?我们如何控制它吗?
默认情况下,没有预定义的顺序 - 您无法知道来自 make_requests_from_url
的 Requests
何时到达 - 它是异步的.
有关如何控制订单的信息,请参阅此答案.长话短说,您可以覆盖 start_requests
并使用 priority
键(如 yield Request(url, meta={'优先级':0})
).例如,priority
的值可以是找到 url 的行号.
这是处理这么多已经在一个目录中的 url 的好方法吗?文件?还有什么?
我认为您应该直接在 start_requests
方法中读取您的文件并生成网址:请参阅 这个答案一>.
所以,你应该这样做:
def start_requests(self):使用 codecs.open(self.file_path, u"r", encoding=u"GB18030") 作为 f:对于索引,enumerate(f) 中的行:尝试:url = line.strip()产量请求(网址,元={'优先级':索引})除了:继续
希望有所帮助.
Help! Reading the source code of Scrapy
is not easy for me.
I have a very long start_urls
list. it is about 3,000,000 in a file. So,I make the start_urls
like this:
start_urls = read_urls_from_file(u"XXXX")
def read_urls_from_file(file_path):
with codecs.open(file_path, u"r", encoding=u"GB18030") as f:
for line in f:
try:
url = line.strip()
yield url
except:
print u"read line:%s from file failed!" % line
continue
print u"file read finish!"
MeanWhile, my spider's callback functions are like this:
def parse(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.baidu.com"), callback=self.just_test1)
def just_test1(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.163.com"), callback=self.just_test2)
def just_test2(self, response):
self.log("Visited %s" % response.url)
return []
my questions are:
- the order of the urls used by downloader? Will the requests made by
just_test1
,just_test2
be used by downloader only after the allstart_urls
are used?(I have made some tests, it seems that the answer is No) - What decides the order? Why and How is this order? How can we control it?
- Is this a good way to deal with so many urls which are already in a file? What else?
Thank you very much!!!
Thanks for answers.But I am still a bit confused: By default, Scrapy uses a LIFO queue for storing pending requests.
- The
requests
made by spiders' callback function will be given to thescheduler
.Who does the same thing tostart_url's requests
?The spiderstart_requests()
function only generate an iterator without giving the real requests. - Will the all
requests
(start_url's and callback's) be in the same request's queue?How many queues are there inScrapy
?
First of all, please see this thread - I think you'll find all the answers there.
the order of the urls used by downloader? Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No)
You are right, the answer is No
. The behavior is completely asynchronous: when the spider starts, start_requests
method is called (source):
def start_requests(self):
for url in self.start_urls:
yield self.make_requests_from_url(url)
def make_requests_from_url(self, url):
return Request(url, dont_filter=True)
What decides the order? Why and How is this order? How can we control it?
By default, there is no pre-defined order - you cannot know when Requests
from make_requests_from_url
will arrive - it's asynchronous.
See this answer on how you may control the order.
Long story short, you can override start_requests
and mark yielded Requests
with priority
key (like yield Request(url, meta={'priority': 0})
). For example, the value of priority
can be the line number where the url was found.
Is this a good way to deal with so many urls which are already in a file? What else?
I think you should read your file and yield urls directly in start_requests
method: see this answer.
So, you should do smth like this:
def start_requests(self):
with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
for index, line in enumerate(f):
try:
url = line.strip()
yield Request(url, meta={'priority': index})
except:
continue
Hope that helps.
这篇关于带有长 start_urls 列表的 Scrapy Crawling URL 的顺序和来自蜘蛛的 urls yiels的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!