在所有具有此语法的页面上进行刮擦 [英] scrapy scrape on all pages that have this syntax
问题描述
我想抓取所有具有此语法的页面
I want to scrapy on all pages that have this syntaxt
mywebsite/?page=INTEGER
我试过了:
start_urls = ['MyWebsite']
rules = [Rule(SgmlLinkExtractor(allow=['/\?page=\d+']), 'parse')]
但似乎链接仍然是MyWebsite
.所以请问我该怎么做才能让它明白我想添加 /?page=NumberOfPage
?请?
but it seems that the link still MyWebsite
. so please what should I do to make it understand that i want to add /?page=NumberOfPage
? please?
mywebsite/?page=1
mywebsite/?page=2
mywebsite/?page=3
mywebsite/?page=4
mywebsite/?page=5
..
..
..
mywebsite/?page=7677654
我的代码
start_urls = [
'http://example.com/?page=%s' % page for page in xrange(1,100000)
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('my xpath')
for site in sites:
DateDifference= site.xpath('xpath for date difference').extract()[0]
if DateDifference.days < 8:
yield Request(Link, meta={'date': Date}, callback = self.crawl)
我想获取最近 7 天内添加的页面的所有数据.我不知道最近 7 天添加了多少页.所以我认为我可以在大量页面上爬行,比如说 100000,然后我检查 datedifference
如果它少于 7 天我想要 yield
如果不是我想完全停止爬行.
I want to get all the data of pages that have been added in the last 7 days. I don't know how many pages have been added in the last 7 days. so i think that i can crawl on a larg number of pages , lets say 100000, then i check the datedifference
if it is less that 7 days i want to yield
if not i want to stop crawling at all.
推荐答案
如果我没猜错,您想抓取所有小于 7 天的页面.一种方法是按顺序跟随每一页(假设第 n°1 页是最年轻的,n°2 比 n°1 早,n°3 比 n°2 早......).
If I get it right, you want to crawl all pages that are younger than 7 days. One way to do it is to follow each page in order (assuming page n°1 is the youngest, n°2 is older than n°1, n°3 older than n°2...).
你可以做类似的事情
start_urls = ['mywebsite/?page=1']
def parse(self, response):
sel = Selector(response)
DateDifference= sel.xpath('xpath for date difference').extract()[0]
i = response.meta['index'] if 'index' in response.meta else 1
if DateDifference.days < 8:
yield Request(Link, meta={'date': Date}, callback = self.crawl)
i += 1
yield Request('mywebsite/?page='+str(i), meta={'index':i}, callback=self.parse)
这个想法是顺序执行parse
.如果这是您第一次进入该函数,response.meta['index']
未定义:索引为 1.如果这是我们已经解析另一个页面后的调用,response.meta['index']
定义:索引表示当前抓取的页面编号.
The idea is to execute parse
sequentially. If this is the first time you enter the function, response.meta['index']
isn't defined: the index is 1. If this is a call after we already parsed another page, response.meta['index']
is defined: the index indicates the number of the page currently scraped.
这篇关于在所有具有此语法的页面上进行刮擦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!