在scrapy的设置稠粘的曲奇饼 [英] Setting sticky cookie in scrapy

查看:213
本文介绍了在scrapy的设置稠粘的曲奇饼的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我抓取的网站有javascript,它设置一个cookie并检查它在后端,以确保js已启用。从html代码提取cookie很简单,但是后来设置它似乎是scrapy中的问题。所以我的代码是:

The website I am scraping has javascript that sets a cookie and checks it in the backend to make sure js is enabled. Extracting the cookie from the html code is simple enough, but then setting it seems to be a problem in scrapy. So my code is:

from scrapy.contrib.spiders.init import InitSpider

class TestSpider(InitSpider):
    ...
    rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html', )), callback='parse_page'),)

    def init_request(self):
        return Request(url = self.init_url, callback=self.parse_js)

    def parse_js(self, response):
        match = re.search('setCookie\(\'(.+?)\',\s*?\'(.+?)\',', response.body, re.M)
        if match:
            cookie = match.group(1)
            value = match.group(2)
        else:
            raise BaseException("Did not find the cookie", response.body)
        return Request(url=self.test_page, callback=self.check_test_page, cookies={cookie:value})

    def check_test_page(self, response):
        if 'Welcome' in response.body:
            self.initialized()

    def parse_page(self, response):
        scraping....

我可以看到内容在 check_test_page cookie工作完美。但它从来没有得到 parse_page ,因为CrawlSpider没有正确的cookie没有看到任何链接。有一种方法可以设置一个cookie的会话期间吗?或者我必须使用BaseSpider并手动添加cookie到每个请求吗?

I can see that the content is available in check_test_page, the cookie works perfectly. But it never even gets to parse_page since CrawlSpider without the right cookie doesn't see any links. Is there a way to set a cookie for the duration of the scraping session? Or do I have to use BaseSpider and add the cookie to every request manually?

不太理想的替代方法是通过scrapy配置文件以某种方式设置cookie(值似乎永远不会改变)。这是可能的吗?

A less desirable alternative would be to set the cookie (the value seems to never change) through scrapy configuration files somehow. Is that possible?

推荐答案

原来,InitSpider是一个BaseSpider。所以看起来像1)在这种情况下没有办法使用CrawlSpider 2)没有办法设置粘性cookie

It turned out that InitSpider is a BaseSpider. So it looks like 1) there's no way to use CrawlSpider in this situation 2) there's no way to set a sticky cookie

这篇关于在scrapy的设置稠粘的曲奇饼的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆