带有长 start_urls 列表的 Scrapy Crawling URL 的顺序和来自蜘蛛的 urls yiels [英] the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider

查看:38
本文介绍了带有长 start_urls 列表的 Scrapy Crawling URL 的顺序和来自蜘蛛的 urls yiels的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

帮助!阅读Scrapy 的源代码对我来说并不容易.我有一个很长的 start_urls 列表.一个文件中大约有 3,000,000 个.所以,我让 start_urls 像这样:

start_urls = read_urls_from_file(u"XXXX")def read_urls_from_file(file_path):使用 codecs.open(file_path, u"r", encoding=u"GB18030") 作为 f:对于 f 中的行:尝试:url = line.strip()产量网址除了:打印 u"从文件读取行:%s 失败!"% 线继续打印 u"文件读取完成!"

同时,我的蜘蛛的回调函数是这样的:

 def parse(self, response):self.log("访问了 %s" % response.url)return Request(url=("http://www.baidu.com"), callback=self.just_test1)def just_test1(self, response):self.log("访问了 %s" % response.url)返回请求(url=(http://www.163.com"),回调=self.just_test2)def just_test2(self, response):self.log("访问了 %s" % response.url)返回 []

我的问题是:

  1. 下载器使用的 url 的顺序?是否会提出请求just_test1,just_test2 仅在下载后使用start_urls 用了吗?(我做了一些测试,似乎答案是否定的)
  2. 什么决定了顺序?为什么以及这个顺序如何?我们如何控制它?
  3. 这是处理文件中已有的如此多网址的好方法吗?还有什么?

非常感谢!!!

谢谢大家的回答,不过我还是有点糊涂:作者默认情况下,Scrapy 使用 LIFO 队列来存储待处理的请求.

  1. 蜘蛛的回调函数发出的requests会交给scheduler.谁对start_url的请求做同样的事情?蜘蛛start_requests() 函数只生成一个迭代器而不给出真正的要求.
  2. 所有请求(start_url的和回调的)都在同一个请求的队列中吗?Scrapy中有多少个队列?

解决方案

首先请看这个thread - 我想你会在那里找到所有的答案.

<块引用>

下载器使用的 url 的顺序?是否会提出请求just_test1,just_test2 只能被下载者使用用start_urls?(我做了一些测试,似乎答案是是否)

你说得对,答案是.行为是完全异步的:当蜘蛛启动时,start_requests 方法被调用(来源):

def start_requests(self):对于 self.start_urls 中的 url:产量 self.make_requests_from_url(url)def make_requests_from_url(self, url):返回请求(网址,dont_filter=True)

<块引用>

什么决定了顺序?为什么以及这个顺序如何?我们如何控制它吗?

默认情况下,没有预定义的顺序 - 您无法知道来自 make_requests_from_urlRequests 何时到达 - 它是异步的.

有关如何控制订单的信息,请参阅此答案.长话短说,您可以覆盖 start_requests 并使用 priority 键(如 yield Request(url, meta={'优先级':0})).例如,priority 的值可以是找到 url 的行号.

<块引用>

这是处理这么多已经在一个目录中的 url 的好方法吗?文件?还有什么?

我认为您应该直接在 start_requests 方法中读取您的文件并生成网址:请参阅 这个答案.

所以,你应该这样做:

def start_requests(self):使用 codecs.open(self.file_path, u"r", encoding=u"GB18030") 作为 f:对于索引,enumerate(f) 中的行:尝试:url = line.strip()产量请求(网址,元={'优先级':索引})除了:继续

希望有所帮助.

Help! Reading the source code of Scrapy is not easy for me. I have a very long start_urls list. it is about 3,000,000 in a file. So,I make the start_urls like this:

start_urls = read_urls_from_file(u"XXXX")
def read_urls_from_file(file_path):
    with codecs.open(file_path, u"r", encoding=u"GB18030") as f:
        for line in f:
            try:
                url = line.strip()
                yield url
            except:
                print u"read line:%s from file failed!" % line
                continue
    print u"file read finish!"

MeanWhile, my spider's callback functions are like this:

  def parse(self, response):
        self.log("Visited %s" % response.url)
        return  Request(url=("http://www.baidu.com"), callback=self.just_test1)
    def just_test1(self, response):
        self.log("Visited %s" % response.url)
        return Request(url=("http://www.163.com"), callback=self.just_test2)
    def just_test2(self, response):
        self.log("Visited %s" % response.url)
        return []

my questions are:

  1. the order of the urls used by downloader? Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No)
  2. What decides the order? Why and How is this order? How can we control it?
  3. Is this a good way to deal with so many urls which are already in a file? What else?

Thank you very much!!!

Thanks for answers.But I am still a bit confused: By default, Scrapy uses a LIFO queue for storing pending requests.

  1. The requests made by spiders' callback function will be given to the scheduler.Who does the same thing to start_url's requests?The spider start_requests() function only generate an iterator without giving the real requests.
  2. Will the all requests(start_url's and callback's) be in the same request's queue?How many queues are there in Scrapy?

解决方案

First of all, please see this thread - I think you'll find all the answers there.

the order of the urls used by downloader? Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No)

You are right, the answer is No. The behavior is completely asynchronous: when the spider starts, start_requests method is called (source):

def start_requests(self):
    for url in self.start_urls:
        yield self.make_requests_from_url(url)

def make_requests_from_url(self, url):
    return Request(url, dont_filter=True)

What decides the order? Why and How is this order? How can we control it?

By default, there is no pre-defined order - you cannot know when Requests from make_requests_from_url will arrive - it's asynchronous.

See this answer on how you may control the order. Long story short, you can override start_requests and mark yielded Requests with priority key (like yield Request(url, meta={'priority': 0})). For example, the value of priority can be the line number where the url was found.

Is this a good way to deal with so many urls which are already in a file? What else?

I think you should read your file and yield urls directly in start_requests method: see this answer.

So, you should do smth like this:

def start_requests(self):
    with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
        for index, line in enumerate(f):
            try:
                url = line.strip()
                yield Request(url, meta={'priority': index})
            except:
                continue

Hope that helps.

这篇关于带有长 start_urls 列表的 Scrapy Crawling URL 的顺序和来自蜘蛛的 urls yiels的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆