如何让scrapy按顺序处理url [英] How can i make scrapy to process the url sequentially
本文介绍了如何让scrapy按顺序处理url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有这个代码
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="headline_area"]')
items = []
for site in sites[:5]:
item = StackItem()
log.msg(' LOOP' +str(ivar)+ '', level=log.ERROR)
item['title'] ="yoo ma"
request = Request("blabla", callback=self.test1)
request.meta['item'] = item
page_number = nextlink.split("&")[-1].split("=")[-1]
if int(page_number) > 500:
raise CloseSpider('Search Exceeded 500')
ivar = ivar + 1
yield request
mylinks= soup.find_all('a')
if mylinks:
nextlink = mylinks[0].get('href')
page_number = nextlink.split("&")[-3].split("=")[-1]
request = Request(urljoin(response.url, nextlink), callback=self.parse)
request.meta['page'] = page_number
yield request
现在我的问题是假设我想停在 page_number = 5
Now my problem is that suppose i want to stop at page_number = 5
现在,scrapy 在第 1 页、第 2 页等的所有项目被下载并在第一次到达那里时停止之前转到该页面.
now scrappy goes to that page before the all items from page 1 , page 2 etc are downloaded and stops when it first reaches there.
如何摆脱在转到page = 5
推荐答案
链接在不同页面上是否有规律?例如,如果第 5 页的链接是 www.xxxx.net/nForum/#!article/Bet/447540?p=5
.您可以直接使用 p=5
抓取链接.
Does the link has some regularity on different page? For example, if the 5th page's link is www.xxxx.net/nForum/#!article/Bet/447540?p=5
. You can scrappy link with p=5
directly.
这篇关于如何让scrapy按顺序处理url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文