Scrapy 在“__init__"之后不调用任何其他函数; [英] Scrapy not calling any other function after "__init__"

查看:63
本文介绍了Scrapy 在“__init__"之后不调用任何其他函数;的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

操作系统:Ubuntu 16.04堆栈 - Scrapy 1.0.3 + Selenium我对scrapy很陌生,这听起来可能很基本,但是在我的蜘蛛中,只有init"正在被执行.之后的任何代码/函数都不会被调用,蜘蛛就会停止.

OS: Ubuntu 16.04 Stack - Scrapy 1.0.3 + Selenium I'm pretty new to scrapy and this might sound very basic, But in my spider, only "init" is being getting executed. Any code/function after that is not getting called and thhe spider just halts.

class CancerForumSpider(scrapy.Spider):
    name = "mainpage_spider"
    allowed_domains = ["cancerforums.net"]
    start_urls = [
        "http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum"
    ]

    def __init__(self,*args,**kwargs):
        self.browser=webdriver.Firefox()
        self.browser.get("http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum")
        print "----------------Going to sleep------------------"
        time.sleep(5)
        # self.parse()

    def __exit__(self):
        print "------------Exiting----------"
        self.browser.quit()

    def parse(self,response):
        print "----------------Inside Parse------------------"
        print "------------Exiting----------"
        self.browser.quit()

蜘蛛获取浏览器对象,打印Going to sleep"然后停止.它不会进入解析函数.

The spider gets the browser object, prints "Going to sleep" and just halts. It doesn't go inside the parse function.

以下是运行日志的内容:

Following are the contents of the run logs:

----------------in init--------------------------------该睡觉了-------------------

----------------inside init---------------- ----------------Going to sleep------------------

推荐答案

您需要解决或注意一些问题:

There are a few problems you need to address or be aware of:

  1. 你不是在调用 super()__init__ 方法期间,因此不会发生任何继承的类初始化.Scrapy 不会做任何事情(比如调用它的 parse() 方法),因为所有这些都在 scrapy.Spider 中设置.

  1. You're not calling super() during the __init__ method, so none of the inherited classes initialization is going to be happening. Scrapy won't do anything (like calling it's parse() method), as that all is setup in scrapy.Spider.

解决上述问题后,您的 parse() 方法将被 Scrapy 调用,但不会在您的 Selenium 获取的网页上运行.它将对此一无所知,并将重新获取 url(基于 start_urls).这两个来源很可能会有所不同(通常差异很大).

After fixing the above, your parse() method will be called by Scrapy, but won't be operating on your Selenium-fetched webpage. It will have no knowledge of this whatsoever, and will go re-fetch the url (based on start_urls). It's very much likely that these two sources will differ (often drastically).

您将像现在这样使用 Selenium 绕过 Scrapy 的几乎所有功能.Selenium 的所有 get() 都将在 Scrapy 框架之外执行.不会应用中间件(cookie、限制、过滤等),也不会使用数据填充任何预期/创建的对象(如 requestresponse)你期望.

You're going to be bypassing almost all of Scrapy's functionality using Selenium the way you are. All of Selenium's get()'s will be executed outside of the Scrapy framework. Middleware won't be applied (cookies, throttling, filtering, etc.) nor will any of the expected/created objects (like request and response) be populated with the data you expect.

在解决所有这些问题之前,您应该考虑几个更好的选择/替代方案:

Before you fix all of that, you should consider a couple of better options/alternatives:

  • 创建一个下载器中间件来处理所有与Selenium"相关的功能.让它在 request 对象到达下载器之前拦截它们,填充一个新的 response 对象并返回它们以供蜘蛛处理.
    这不是最佳选择,因为您正在有效地创建自己的下载器,并使 Scrapy 短路.您必须重新实现对下载器通常考虑的任何所需设置的处理,并使它们与 Selenium 一起使用.
  • 抛弃 Selenium 并使用 Splash HTTP 和 scrapy-splash 中间件来处理 Javascript.
  • 一起抛弃 Scrapy,只使用 Selenium 和 BeautifulSoup.
  • Create a downloader middleware that handles all "Selenium" related functionality. Have it intercept request objects right before they hit the downloader, populate a new response objects and return them for processing by the spider.
    This isn't optimal, as you're effectively creating your own downloader, and short-circuiting Scrapy's. You'll have to re-implement the handling of any desired settings the downloader usually takes into account and make them work with Selenium.
  • Ditch Selenium and use the Splash HTTP and scrapy-splash middleware for handling Javascript.
  • Ditch Scrapy all together and just use Selenium and BeautifulSoup.

这篇关于Scrapy 在“__init__"之后不调用任何其他函数;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆