将 InitSpider 与飞溅一起使用:仅解析登录页面? [英] Using InitSpider with splash: only parsing the login page?

查看:58
本文介绍了将 InitSpider 与飞溅一起使用:仅解析登录页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我之前问过的.

我正在尝试抓取一个必须先登录才能访问的网页.但是经过身份验证后,我需要的网页需要运行一点Javascript才能查看内容.我所做的是按照说明在这里安装飞溅以尝试呈现Javascript.不过……

I'm trying to scrape a webpage which I have to login to reach first. But after authentication, the webpage I need requires a little bit of Javascript to be run before you can view the content. What I've done is followed the instructions here to install splash to try to render the Javascript. However...

在我切换到 splash 之前,使用 Scrapy 的 InitSpider 进行身份验证很好.我正在通过登录页面并抓取目标页面确定(显然,Javascript 不工作除外).但是一旦我添加了代码以通过飞溅传递请求,看起来我没有解析目标页面.

Before I switched to splash, the authentication with Scrapy's InitSpider was fine. I was getting through the login page and scraping the target page OK (except without the Javascript working, obviously). But once I add the code to pass the requests through splash, it looks like I'm not parsing the target page.

下面的蜘蛛.初始版本(此处)和非初始版本之间的唯一区别是函数 def start_requests().两者之间的其他一切都相同.

Spider below. The only difference between the splash version (here) and the non-splash version is the function def start_requests(). Everything else is the same between the two.

import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

class BboSpider(InitSpider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    start_urls = [
            "http://www.bridgebase.com/myhands/index.php"
            ]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F" 

    # authentication
    def init_request(self):
        return scrapy.http.Request(url=self.login_page, callback=self.login)

    def login(self, response):
        return scrapy.http.FormRequest.from_response(
            response,
            formdata={'username': 'USERNAME', 'password': 'PASSWORD'},
            callback=self.check_login_response)

    def check_login_response(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
            return self.initialized()
        else:
            self.log("Login failed")
            print(response.body)

    # pipe the requests through splash so the JS renders 
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            }) 

    # what to do when a link is encountered
    rules = (
            Rule(LinkExtractor(), callback='parse_item'),
            )

    # do nothing on new link for now
    def parse_item(self, response):
        pass

    def parse(self, response):
        filename = 'test.html' 
        with open(filename, 'wb') as f:
            f.write(response.body)

现在发生的事情是 test.htmlparse() 的结果,现在只是登录页面本身,而不是我应该成为的页面登录后重定向到.

What's happening now is that test.html, the result of parse(), is now simply the login page itself rather than the page I'm supposed to be redirected to after login.

这在日志中说明——通常,我会看到 check_login_response() 中的登录成功"行,但正如您在下面看到的那样,我什至没有看到那一步.这是因为scrapy现在也通过splash发送身份验证请求,并且它被挂在那里了吗?如果是这样的话,有没有什么办法可以绕过splash,只针对认证部分?

This is telling in the log--ordinarily, I would see the "Login successful" line from check_login_response(), but as you can see below it seems like I'm not even getting to that step. Is this because scrapy is now putting the authentication requests through splash too, and that it's getting hung up there? If that's the case, is there any way to bypass splash only for the authentication part?

2016-01-24 14:54:56 [scrapy] INFO: Spider opened
2016-01-24 14:54:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-24 14:54:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-24 14:55:02 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
2016-01-24 14:55:02 [scrapy] INFO: Closing spider (finished)

我很确定我没有正确使用飞溅.谁能指点我一些文档,让我可以弄清楚发生了什么?

I'm pretty sure I'm not using splash correctly. Can anyone point me to some documentation where I can figure out what's going on?

推荐答案

我认为仅靠 Splash 无法很好地处理这种特殊情况.

I don't think Splash alone would handle this particular case well.

这是工作的想法:

  • use selenium and PhantomJS headless browser to log into the website
  • pass the browser cookies from PhantomJS into Scrapy

代码:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class BboSpider(scrapy.Spider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"

    def start_requests(self):
        driver = webdriver.PhantomJS()
        driver.get(self.login_page)

        driver.find_element_by_id("username").send_keys("user")
        driver.find_element_by_id("password").send_keys("password")

        driver.find_element_by_name("submit").click()

        driver.save_screenshot("test.png")
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Click here for results of recent tournaments")))

        cookies = driver.get_cookies()
        driver.close()

        yield scrapy.Request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)

    def parse(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
        else:
            self.log("Login failed")
        print(response.body)

打印登录成功和手"页面的HTML.

Prints Login successful and the HTML of the "hands" page.

这篇关于将 InitSpider 与飞溅一起使用:仅解析登录页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆