Scrapy - 如何管理Cookie /会话 [英] Scrapy - how to manage cookies/sessions

查看:287
本文介绍了Scrapy - 如何管理Cookie /会话的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Cookie如何使用Scrapy以及如何管理这些Cookie感到困惑。



这基本上是一个简化的版本尝试执行:
>



>

当您进行搜索时,网站会记住您搜索的内容,因此当您执行某些操作(例如转到结果的下一页)时,它会知道正在处理的搜索。 p>




我的脚本:



我的蜘蛛有一个起始网址searchpage_url



parse()请求搜索页,搜索表单响应将传递到 search_generator()



search_generator() / code>使用 FormRequest 和搜索表单回复的大量搜索请求。



FormRequests和后续的子请求需要有自己的会话,所以需要有自己的cookiejar和自己的会话cookie。






我已经看到文档中涉及一个元选项的部分,它阻止了Cookie的合并。这是什么意思?



如果这些cookie是在每个蜘蛛级别,那么它是如何做的呢?它产生多个蜘蛛时工作?是否有可能只让第一个请求生成器生成新的蜘蛛,并确保从那时起只有那个蜘蛛处理未来的请求?



我假设我必须禁用多个并发请求..否则一个蜘蛛会在同一会话cookie下进行多次搜索,而未来的请求将只与最近的搜索有关?



,将大大收到任何澄清!






编辑:



我刚刚想到的另一个选项是完全手动管理会话cookie,并将它从一个请求传递给另一个请求。



我想这将意味着禁用cookies。然后从搜索响应中抓取会话cookie,并将其传递给每个后续请求。



这是您在这种情况下应该做什么吗?

解决方案

三年后,我认为这正是你想要的:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta- cookiejar



在您的蜘蛛的start_requests方法中使用类似的方法:

 code> for i,url in enumerate(urls):
yield scrapy.Request(http://www.example.com,meta = {'cookiejar':i},
callback = self.parse_page)

并记住,对于后续请求,您需要显式地重新连接cookiejar时间:

  def parse_page(self,response):
#做一些处理
return scrapy.Request (http://www.example.com/otherpage,
meta = {'cookiejar':response.meta ['cookiejar']},
callback = self.parse_other_page)


I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.

This is basically a simplified version of what I'm trying to do:


The way the website works:

When you visit the website you get a session cookie.

When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.


My script:

My spider has a start url of searchpage_url

The searchpage is requested by parse() and the search form response gets passed to search_generator()

search_generator() then yields lots of search requests using FormRequest and the search form response.

Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.


I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?

If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?

I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?

I'm confused, any clarification would be greatly received!


EDIT:

Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.

I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.

Is this what you should do in this situation?

解决方案

Three years later, I think this is exactly what you were looking for: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar

Just use something like this in your spider's start_requests method:

for i, url in enumerate(urls):
    yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
        callback=self.parse_page)

And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:

def parse_page(self, response):
    # do some processing
    return scrapy.Request("http://www.example.com/otherpage",
        meta={'cookiejar': response.meta['cookiejar']},
        callback=self.parse_other_page)

这篇关于Scrapy - 如何管理Cookie /会话的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆