按顺序抓取抓取网址 [英] Scrapy Crawl URLs in Order
问题描述
所以,我的问题比较简单.我有一只蜘蛛正在爬取多个站点,我需要它按照我在代码中编写的顺序返回数据.贴在下面.
So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem
class MLBoddsSpider(BaseSpider):
name = "sbrforum.com"
allowed_domains = ["sbrforum.com"]
start_urls = [
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
items = []
for site in sites:
item = MlboddsItem()
item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
items.append(item)
return items
结果以随机顺序返回,例如返回第 29 个,然后是第 28 个,然后是第 30 个.我已经尝试将调度程序顺序从 DFO 更改为 BFO,以防万一这是问题所在,但这并没有改变任何东西.
The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.
推荐答案
start_urls
定义了 start_requests
方法.当页面被下载时,你的 parse
方法被调用并响应每个起始 url.但是您无法控制加载时间 - 第一个起始 url 可能排在最后一个 parse
.
start_urls
defines urls which are used in start_requests
method. Your parse
method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse
.
解决方案——覆盖 start_requests
方法并向生成的请求添加带有 priority
键的 meta
.在parse
中提取此priority
值并将其添加到item
.在管道中根据这个值做一些事情.(我不知道你为什么以及在哪里需要按照这个顺序处理这些 url).
A solution -- override start_requests
method and add to generated requests a meta
with priority
key. In parse
extract this priority
value and add it to the item
. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).
或者让它有点同步——将这些起始 url 存储在某个地方.放入 start_urls
中的第一个.在 parse
处理第一个响应并产生项目,然后从您的存储中获取下一个 url 并使用 parse
的回调请求它.
Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls
the first of them. In parse
process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse
.
这篇关于按顺序抓取抓取网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!