在所有具有此语法的页面上进行刮擦 [英] scrapy scrape on all pages that have this syntax

查看：45 发布时间：2021/6/26 19:27:07 python python-2.7 scrapy

本文介绍了在所有具有此语法的页面上进行刮擦的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想抓取所有具有此语法的页面

I want to scrapy on all pages that have this syntaxt

mywebsite/?page=INTEGER

我试过了:

start_urls = ['MyWebsite']
rules = [Rule(SgmlLinkExtractor(allow=['/\?page=\d+']), 'parse')]

但似乎链接仍然是MyWebsite.所以请问我该怎么做才能让它明白我想添加 /?page=NumberOfPage ?请?

but it seems that the link still MyWebsite. so please what should I do to make it understand that i want to add /?page=NumberOfPage ? please?

mywebsite/?page=1
mywebsite/?page=2
mywebsite/?page=3
mywebsite/?page=4
mywebsite/?page=5
..
..
..
mywebsite/?page=7677654

我的代码

start_urls = [
        'http://example.com/?page=%s' % page for page in xrange(1,100000)
    ]
def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('my xpath')
    for site in sites:
         
        DateDifference= site.xpath('xpath for date difference').extract()[0]
         
        if DateDifference.days < 8:
            yield Request(Link, meta={'date': Date}, callback = self.crawl)

我想获取最近 7 天内添加的页面的所有数据.我不知道最近 7 天添加了多少页.所以我认为我可以在大量页面上爬行，比如说 100000，然后我检查 datedifference 如果它少于 7 天我想要 yield 如果不是我想完全停止爬行.

I want to get all the data of pages that have been added in the last 7 days. I don't know how many pages have been added in the last 7 days. so i think that i can crawl on a larg number of pages , lets say 100000, then i check the datedifference if it is less that 7 days i want to yield if not i want to stop crawling at all.

推荐答案

如果我没猜错，您想抓取所有小于 7 天的页面.一种方法是按顺序跟随每一页(假设第 n°1 页是最年轻的，n°2 比 n°1 早，n°3 比 n°2 早......).

If I get it right, you want to crawl all pages that are younger than 7 days. One way to do it is to follow each page in order (assuming page n°1 is the youngest, n°2 is older than n°1, n°3 older than n°2...).

你可以做类似的事情

start_urls = ['mywebsite/?page=1']

def parse(self, response):
    sel = Selector(response)
    DateDifference= sel.xpath('xpath for date difference').extract()[0]

    i = response.meta['index'] if 'index' in response.meta else 1

    if DateDifference.days < 8:
        yield Request(Link, meta={'date': Date}, callback = self.crawl)
        i += 1
        yield Request('mywebsite/?page='+str(i), meta={'index':i}, callback=self.parse)

这个想法是顺序执行parse.如果这是您第一次进入该函数，response.meta['index'] 未定义:索引为 1.如果这是我们已经解析另一个页面后的调用，response.meta['index'] 定义:索引表示当前抓取的页面编号.

The idea is to execute parse sequentially. If this is the first time you enter the function, response.meta['index'] isn't defined: the index is 1. If this is a call after we already parsed another page, response.meta['index'] is defined: the index indicates the number of the page currently scraped.

这篇关于在所有具有此语法的页面上进行刮擦的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在所有具有此语法的页面上进行刮擦 [英] scrapy scrape on all pages that have this syntax

问题描述

我的代码

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在所有具有此语法的页面上进行刮擦 [英] scrapy scrape on all pages that have this syntax

问题描述

我的代码

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭