顺序抓取抓取网址 [英] Scrapy Crawl URLs in Order

查看:68
本文介绍了顺序抓取抓取网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我的问题相对简单.我有一个蜘蛛爬行多个站点,并且我需要它按照我在代码中写入的顺序返回数据.在下面发布.

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items

结果以随机顺序返回,例如,它返回第29位,然后返回28位,然后返回30位.我曾尝试将调度程序的顺序从DFO更改为BFO,以防万一这是问题所在,但这并没有任何改变.

The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.

推荐答案

start_urls定义

start_urls defines urls which are used in start_requests method. Your parse method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse.

一种解决方案-覆盖start_requests方法,并使用priority键将meta添加到生成的请求中.在parse中,提取此priority值并将其添加到item中.在管道中,基于此值执行某些操作. (我不知道为什么需要这些URL以及在何处按此顺序进行处理.)

A solution -- override start_requests method and add to generated requests a meta with priority key. In parse extract this priority value and add it to the item. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).

或者使其具有同步性-将这些起始网址存储在某个地方.首先输入start_urls.在parse中处理第一个响应并产生商品,然后从存储中获取下一个URL,并通过parse的回调请求它.

Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls the first of them. In parse process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse.

这篇关于顺序抓取抓取网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆