Scrapy process_links 和 process_request 的示例代码 [英] Example code for Scrapy process_links and process_request

查看:81
本文介绍了Scrapy process_links 和 process_request 的示例代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Scrapy 的新手,我希望是否有人能给我很好的示例代码,说明 process_links 和 process_request 何时最有用.我看到 process_links 用于过滤 URL,但我不知道如何编码.

I am new to Scrapy and I was hoping if anyone can give me good example codes of when process_links and process_request are most useful. I see that process_links is used to filter URL's but I don't know how to code it.

谢谢.

推荐答案

你的意思是 scrapy.spiders.Rule 最常用于 scrapy.CrawlSpider

You mean scrapy.spiders.Rule that is most commonly used in scrapy.CrawlSpider

它们几乎按照名称进行操作,或者换句话说,在提取链接和处理/下载链接之间充当某种中间件.

They do pretty much what the names say or in other words that act as sort of middleware between the time the link is extracted and processed/downloaded.

process_links 位于链接被提取并转换为 request 之间.有很多很酷的用例,仅举几个常见的例子:

process_links sits between when link is extracted and turned into request . There are pretty cool use cases for this, just to name a few common ones:

  1. 过滤掉一些您不喜欢的链接.
  2. 手动重定向以避免错误请求.

示例:

def process_links(self, link):
    for link in links:
        #1
        if 'foo' in link.text:
            continue  # skip all links that have "foo" in their text
        yield link 
        #2
        link.url = link.url + '/'  # fix url to avoid unnecessary redirection
        yield link

process_requests 位于刚刚发出的请求和下载之前的请求之间.它与 process_links 共享一些用例,但实际上可以做一些其他很酷的事情,例如:

process_requests sits between request that was just made and before it is being downloaded. It shares some use cases with the process_links but can actually do some other cool stuff like:

  1. 修改标题(例如 cookie).
  2. 根据网址中的某些关键字更改回调等详细信息.

示例:

def process_req(self, req):
    # 1
    req = req.replace(headers={'Cookie':'foobar'})
    return req
    # 2
    if 'foo' in req.url:
        return req.replace(callback=self.parse_foo)
    elif 'bar' in req.url:
        return req.replace(callback=self.parse_bar)
    return req

您可能不会经常使用它们,但这两个在某些情况下确实是非常方便和简单的快捷方式.

You probably not gonna use them often but these two can be really convenient and easy shortcuts on some occasions.

这篇关于Scrapy process_links 和 process_request 的示例代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆