在所有具有此语法的页面上进行刮擦 [英] scrapy scrape on all pages that have this syntax

查看:45
本文介绍了在所有具有此语法的页面上进行刮擦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取所有具有此语法的页面

I want to scrapy on all pages that have this syntaxt

mywebsite/?page=INTEGER

我试过了:

start_urls = ['MyWebsite']
rules = [Rule(SgmlLinkExtractor(allow=['/\?page=\d+']), 'parse')]

但似乎链接仍然是MyWebsite.所以请问我该怎么做才能让它明白我想添加 /?page=NumberOfPage ?请?

but it seems that the link still MyWebsite. so please what should I do to make it understand that i want to add /?page=NumberOfPage ? please?

mywebsite/?page=1
mywebsite/?page=2
mywebsite/?page=3
mywebsite/?page=4
mywebsite/?page=5
..
..
..
mywebsite/?page=7677654

我的代码

start_urls = [
        'http://example.com/?page=%s' % page for page in xrange(1,100000)
    ]
def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('my xpath')
    for site in sites:
         
        DateDifference= site.xpath('xpath for date difference').extract()[0]
         
        if DateDifference.days < 8:
            yield Request(Link, meta={'date': Date}, callback = self.crawl)

我想获取最近 7 天内添加的页面的所有数据.我不知道最近 7 天添加了多少页.所以我认为我可以在大量页面上爬行,比如说 100000,然后我检查 datedifference 如果它少于 7 天我想要 yield 如果不是我想完全停止爬行.

I want to get all the data of pages that have been added in the last 7 days. I don't know how many pages have been added in the last 7 days. so i think that i can crawl on a larg number of pages , lets say 100000, then i check the datedifference if it is less that 7 days i want to yield if not i want to stop crawling at all.

推荐答案

如果我没猜错,您想抓取所有小于 7 天的页面.一种方法是按顺序跟随每一页(假设第 n°1 页是最年轻的,n°2 比 n°1 早,n°3 比 n°2 早......).

If I get it right, you want to crawl all pages that are younger than 7 days. One way to do it is to follow each page in order (assuming page n°1 is the youngest, n°2 is older than n°1, n°3 older than n°2...).

你可以做类似的事情

start_urls = ['mywebsite/?page=1']

def parse(self, response):
    sel = Selector(response)
    DateDifference= sel.xpath('xpath for date difference').extract()[0]

    i = response.meta['index'] if 'index' in response.meta else 1

    if DateDifference.days < 8:
        yield Request(Link, meta={'date': Date}, callback = self.crawl)
        i += 1
        yield Request('mywebsite/?page='+str(i), meta={'index':i}, callback=self.parse)

这个想法是顺序执行parse.如果这是您第一次进入该函数,response.meta['index'] 未定义:索引为 1.如果这是我们已经解析另一个页面后的调用,response.meta['index'] 定义:索引表示当前抓取的页面编号.

The idea is to execute parse sequentially. If this is the first time you enter the function, response.meta['index'] isn't defined: the index is 1. If this is a call after we already parsed another page, response.meta['index'] is defined: the index indicates the number of the page currently scraped.

这篇关于在所有具有此语法的页面上进行刮擦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆