Scrapy - 如何管理 cookie/会话 [英] Scrapy - how to manage cookies/sessions

查看:41
本文介绍了Scrapy - 如何管理 cookie/会话的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 cookie 如何与 Scrapy 一起工作以及您如何管理这些 cookie 感到有些困惑.

I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.

这基本上是我正在尝试做的事情的简化版本:

This is basically a simplified version of what I'm trying to do:

当您访问该网站时,您会获得一个会话 cookie.

When you visit the website you get a session cookie.

当您进行搜索时,网站会记住您搜索的内容,因此当您执行诸如转到下一页结果之类的操作时,它会知道它正在处理的搜索.

When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.

我的蜘蛛的起始网址为 searchpage_url

My spider has a start url of searchpage_url

搜索页面由 parse() 请求,搜索表单响应被传递给 search_generator()

The searchpage is requested by parse() and the search form response gets passed to search_generator()

search_generator() 然后 yield 使用 FormRequest 和搜索表单响应产生大量搜索请求.

search_generator() then yields lots of search requests using FormRequest and the search form response.

这些 FormRequest 和后续子请求中的每一个都需要拥有自己的会话,因此需要拥有自己的个人 cookiejar 和自己的会话 cookie.

Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.

我看过文档中讨论阻止 cookie 合并的元选项的部分.这实际上意味着什么?这是否意味着发出请求的蜘蛛将在其余生中拥有自己的 cookiejar?

I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?

如果 cookie 是在每个蜘蛛级别上的,那么当生成多个蜘蛛时它是如何工作的?是否可以只让第一个请求生成器产生新的蜘蛛,并确保从那时起只有那个蜘蛛处理未来的请求?

If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?

我假设我必须禁用多个并发请求..否则一个蜘蛛会在同一个会话 cookie 下进行多次搜索,而未来的请求只会与最近的搜索相关?

I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?

我很困惑,任何澄清都会很受欢迎!

I'm confused, any clarification would be greatly received!

我刚刚想到的另一种选择是完全手动管理会话 cookie,并将其从一个请求传递到另一个请求.

Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.

我想这意味着禁用 cookie.. 然后从搜索响应中获取会话 cookie,并将其传递给每个后续请求.

I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.

这是你在这种情况下应该做的吗?

Is this what you should do in this situation?

推荐答案

三年后,我认为这正是您要寻找的:http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar

Three years later, I think this is exactly what you were looking for: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar

只需在蜘蛛的 start_requests 方法中使用类似的东西:

Just use something like this in your spider's start_requests method:

for i, url in enumerate(urls):
    yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
        callback=self.parse_page)

并记住,对于后续请求,您每次都需要明确重新附加 cookiejar:

And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:

def parse_page(self, response):
    # do some processing
    return scrapy.Request("http://www.example.com/otherpage",
        meta={'cookiejar': response.meta['cookiejar']},
        callback=self.parse_other_page)

这篇关于Scrapy - 如何管理 cookie/会话的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆