Scrapy process_links 和 process_request 的示例代码 [英] Example code for Scrapy process_links and process_request
问题描述
我是 Scrapy 的新手,我希望是否有人能给我很好的示例代码,说明 process_links 和 process_request 何时最有用.我看到 process_links 用于过滤 URL,但我不知道如何编码.
I am new to Scrapy and I was hoping if anyone can give me good example codes of when process_links and process_request are most useful. I see that process_links is used to filter URL's but I don't know how to code it.
谢谢.
推荐答案
你的意思是 scrapy.spiders.Rule
最常用于 scrapy.CrawlSpider
You mean scrapy.spiders.Rule
that is most commonly used in scrapy.CrawlSpider
它们几乎按照名称进行操作,或者换句话说,在提取链接和处理/下载链接之间充当某种中间件.
They do pretty much what the names say or in other words that act as sort of middleware between the time the link is extracted and processed/downloaded.
process_links
位于链接被提取并转换为 request 之间.有很多很酷的用例,仅举几个常见的例子:
process_links
sits between when link is extracted and turned into request . There are pretty cool use cases for this, just to name a few common ones:
- 过滤掉一些您不喜欢的链接.
- 手动重定向以避免错误请求.
示例:
def process_links(self, link):
for link in links:
#1
if 'foo' in link.text:
continue # skip all links that have "foo" in their text
yield link
#2
link.url = link.url + '/' # fix url to avoid unnecessary redirection
yield link
process_requests
位于刚刚发出的请求和下载之前的请求之间.它与 process_links
共享一些用例,但实际上可以做一些其他很酷的事情,例如:
process_requests
sits between request that was just made and before it is being downloaded. It shares some use cases with the process_links
but can actually do some other cool stuff like:
- 修改标题(例如 cookie).
- 根据网址中的某些关键字更改回调等详细信息.
示例:
def process_req(self, req):
# 1
req = req.replace(headers={'Cookie':'foobar'})
return req
# 2
if 'foo' in req.url:
return req.replace(callback=self.parse_foo)
elif 'bar' in req.url:
return req.replace(callback=self.parse_bar)
return req
您可能不会经常使用它们,但这两个在某些情况下确实是非常方便和简单的快捷方式.
You probably not gonna use them often but these two can be really convenient and easy shortcuts on some occasions.
这篇关于Scrapy process_links 和 process_request 的示例代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!