Scrapy InIt self.initialized() -- 未初始化 [英] Scrapy InIt self.initialized() -- not initializing

查看:54
本文介绍了Scrapy InIt self.initialized() -- 未初始化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scrapy 在 init 中登录网站,然后在确认登录后我想通过 start_urls 初始化并启动标准爬网.我不确定出了什么问题,但我很清楚登录,并且每件事都得到确认,但 parse_item 永远不会启动.任何帮助将不胜感激.

I am trying to use Scrapy to login to a website in the init then after confirming login I want to initialize and start the standard crawl through start_urls. Im not sure what is going wrong but i get clear to the login and every thing confirms but parse_item never starts. Any help would be well appreciated.

我可以做到================成功登录=================="

I can get it up to "================Successfully logged in================="

但是

我无法到达==========================解析项目==========================="

I can not get to "==========================PARSE ITEM=========================="

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from selenium import webdriver

class ProductDetailsSpider(InitSpider):
    name = 'product_details_spider'
    allowed_domains = ['my_domain.com']
    login_page = 'http://www.my_domain.com/'
    start_urls = ['http://www.my_domain.com/nextpage1/',
                  'http://www.my_domain.com/nextpage2/',
                  'http://www.my_domain.com/nextpage3/']

    rules = (
        Rule(SgmlLinkExtractor(allow=()),
            callback='parse_item',
            follow=True),
        )

    def get_cookies(self):
        driver = webdriver.Firefox()
        driver.implicitly_wait(30)
        base_url = "http://www.my_domain.com"
        driver.get(base_url + "/")
        driver.find_element_by_name("USR").clear()
        driver.find_element_by_name("USR").send_keys("my_user")
        driver.find_element_by_name("PASSWRD").clear()
        driver.find_element_by_name("PASSWRD").send_keys("my_pass")
        driver.find_element_by_name("submit").click()
        cookies = driver.get_cookies()
        driver.close()
        cookie_dic = {}
        for c in cookies:
            cookie_dic[c['name']] = c['value']
        return cookie_dic

    def init_request(self):
        print '=======================INIT======================='
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        print '=======================LOGIN======================='
        """Generate a login request."""
        return [FormRequest.from_response(response,formname='login_form',
            formdata={'USR': 'my_user', 'PASSWRD': 'my_pass'},
            callback=self.login_cookies)]

    def login_cookies(self, response):
        print '=======================COOKIES======================='
        return Request(url='http://www.my_domain.com/home',
            cookies=self.get_cookies(),
            callback=self.check_login_response)

    def check_login_response(self, response):
        print '=======================CHECK LOGIN======================='
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "Logoff" in response.body:
            print "=========Successfully logged in.========="
            self.initialized()
            # Now the crawling can begin..
        else:
            print "==============Bad times :(==============="
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_item(self, response):
        print "==============PARSE ITEM=========================="
    # Scrape data from page

推荐答案

我参加聚会有点晚了,但我很确定你需要返回 self.intialized():

I'm a bit late to the party, but I'm quite sure that you need to return the self.intialized():

if "Logoff" in response.body:
    print "=========Successfully logged in.========="
    return self.initialized()
    # Now the crawling can begin..

这篇关于Scrapy InIt self.initialized() -- 未初始化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆