通过Scrapy进行身份验证时爬行的LinkedIn [英] Crawling LinkedIn while authenticated with Scrapy

查看:153
本文介绍了通过Scrapy进行身份验证时爬行的LinkedIn的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我已通读在Scrapy中使用经过身份验证的会话进行爬网并且我挂断了电话,我99%确信我的解析代码是正确的,我只是不相信登录名正在重定向并且成功.

So I've read through the Crawling with an authenticated session in Scrapy and I am getting hung up, I am 99% sure that my parse code is correct, I just don't believe the login is redirecting and being successful.

check_login_response()也存在问题,不确定它正在检查哪个页面.尽管注销"会很有意义.

I also am having an issue with the check_login_response() not sure what page it is checking.. Though "Sign Out" would make sense.







=======更新======

====== UPDATED ======

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from linkedpy.items import LinkedPyItem

class LinkedPySpider(InitSpider):
    name = 'LinkedPy'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
    start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        #"""Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'session_key': 'user@email.com', 'session_password': 'somepassword'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
        if "Sign Out" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..

            return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****

        else:
            self.log("\n\n\nFailed, Bad times :(\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ol[@id=\'result-set\']/li')
        items = []
        for site in sites:
            item = LinkedPyItem()
            item['title'] = site.select('h2/a/text()').extract()
            item['link'] = site.select('h2/a/@href').extract()
            items.append(item)
        return items





通过在self.initialized()前面添加"Return"来解决该问题

The issue was resolved by adding 'Return' in front of self.initialized()

再次感谢! -马克

推荐答案

class LinkedPySpider(BaseSpider):

应为:

class LinkedPySpider(InitSpider):

此外,您也不应覆盖我在此处的答案中提到的parse函数:

Also you shouldn't override the parse function as I mentioned in my answer here: https://stackoverflow.com/a/5857202/crawling-with-an-authenticated-session-in-scrapy

如果您不了解如何定义提取链接的规则,请仔细阅读以下文档:
http://readthedocs.org/docs/scrapy/zh-CN/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/zh-CN/latest/topics/link-extractors.html#topics-link-extractors

If you don't understand how to define the rules for extracting links, just have a proper read through the documentation:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors

这篇关于通过Scrapy进行身份验证时爬行的LinkedIn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆