登录后开始爬行 [英] Scrapy start Crawling after login

查看:44
本文介绍了登录后开始爬行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

免责声明:我正在抓取的网站是公司内部网,为了公司隐私,我稍微修改了网址.

我设法登录了该站点,但无法抓取该站点.

start_url 开始

如您所见,href 并没有合成我们在爬入该页面时看到的实际链接(上面引用的链接).它的有用内容前面还有一个#.这会是问题的根源吗?

对于restricted_xpaths,我已经限制了注销"页面的路径.

from scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.selector import Selector从scrapy.http导入请求,FormRequest从scrapy.linkextractors 导入LinkExtractor导入scrapy类kmssSpider(CrawlSpider):名称='kmss'start_url = ('https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf',)login_page = 'https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login'allowed_domain = ["kmssqkr.sarg"]rules= (Rule(LinkExtractor(allow=(r'https://kmssqkr.sarg/LotusQuickr/dept/\w*'),restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),唯一=真),callback='parse_item', follow = True),)# r"LotusQuickr/dept/^[ A-Za-z0-9_@./#&+-]*$"#restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique = True)def start_requests(self):产量请求(url=self.login_page,回调=self.login,dont_filter = True)定义登录(自我,响应):return FormRequest.from_response(response,formdata={'user':'user','password':'pw'},回调 = self.check_login_response)def check_login_response(self,response):如果在 response.body 中欢迎":self.log("\n\n\n\n 成功登录\n\n\n ")产量请求(url=self.start_url[0])别的:self.log("\n\n 您还没有登录\n\n " )def parse_item(self, response):self.log('这是一个项目页面!%s' % response.url)经过

日志:

2015-07-27 16:46:18 [scrapy] INFO:启用扩展:CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState2015-07-27 16:46:18 [boto] 调试:从元数据服务器检索凭据.2015-07-27 16:46:19 [boto] 错误:捕获异常读取实例数据回溯(最近一次调用最后一次):文件C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py",第 210 行,在 retry_urlr = opener.open(req, timeout=timeout)文件C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py",第 431 行,打开response = self._open(req, data)文件C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py",第 449 行,在 _open'_open',请求)文件C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py",第 409 行,在 _call_chain结果 = func(*args)文件C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py",第 1227 行,在 http_open返回 self.do_open(httplib.HTTPConnection, req)文件C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py",第 1197 行,在 do_open引发 URLError(err)URLError:<urlopen 错误超时>2015-07-27 16:46:19 [boto] 错误:无法读取实例数据,放弃2015-07-27 16:46:19 [scrapy] 信息:已启用下载器中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、DiddleStaterChunsProxyMiddleware、2015-07-27 16:46:19 [scrapy] 信息:启用蜘蛛中间件:HttpErrorMiddleware、OffsiteMiddleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware2015-07-27 16:46:19 [scrapy] 信息:启用项目管道:2015-07-27 16:46:19 [scrapy] 信息:Spider 打开2015-07-27 16:46:19 [scrapy] 信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)2015-07-27 16:46:19 [scrapy] 调试:Telnet 控制台监听 127.0.0.1:60232015-07-27 16:46:24 [scrapy] 调试:爬行(200)<GET https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login>(参考:无)2015-07-27 16:46:28 [scrapy] 调试:爬行(200)<POST https://kmssqkr.ccgo.sarg/names.nsf?Login>(参考:https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login)2015-07-27 16:46:29 [kmss] 调试:登录成功2015-07-27 16:46:29 [scrapy] 调试:重定向 (302) 到 <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b85255717c4900b8525717c4900来自<GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf>2015-07-27 16:46:29 [scrapy] 调试:重定向 (302) 到 <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d2748257770bgt/OpenDocument<GET来自<GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument>2015-07-27 16:46:29 [scrapy] 调试:爬行(200)<GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d2748257770bgt;OpenDocument<GET(参考:https://kmssqkr.sarg/names.nsf?Login)2015-07-27 16:46:29 [scrapy] 信息:关闭蜘蛛(已完成)2015-07-27 16:46:29 [scrapy] 信息:倾销 Scrapy 统计数据:{'下载器/请求字节':1954,'下载者/请求计数':5,'下载器/request_method_count/GET': 4,'下载器/request_method_count/POST': 1,下载器/响应字节":31259,'下载者/响应计数':5,'下载器/response_status_count/200':3,'下载器/response_status_count/302':2,'finish_reason': '完成','finish_time': datetime.datetime(2015, 7, 27, 8, 46, 29, 286000),'log_count/DEBUG': 8,日志计数/错误":2,'log_count/INFO': 7,日志计数/警告":1,'request_depth_max': 2,'response_received_count':3,'调度程序/出队':5,调度程序/出队/内存":5,'调度程序/排队':5,调度程序/入队/内存":5,'start_time': datetime.datetime(2015, 7, 27, 8, 46, 19, 528000)}2015-07-27 16:46:29 [scrapy] INFO:Spider 关闭(已完成)[1]:http://i.stack.imgur.com/REQXJ.png

-----------------------------------更新---------------------------------------

我在

解决方案

首先不要苛求,有时候我会生气,不会回答你的问题.

要查看与您的 Request 一起发送的 cookie,请使用 COOKIES_DEBUG = True 启用调试.

然后你会注意到,即使 Scrapy 的中间件应该发送这些 cookie,也不会发送 cookie.我认为这是因为您yield自定义请求,而 Scrapy 不会比您更聪明,并接受您的解决方案以在没有 cookie 的情况下发送此请求.

这意味着您需要从 response 访问 cookie,并将所需的(或全部)添加到您的 Request.

Disclaimer: The site I am crawling is a corporate intranet and I modified the url a bit for corporate privacy.

I managed to log into the site but I have failed to crawl the site.

Start from start_url https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf( this site would direct you to a similar site with more complex url :

i.e.

https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument {unid=ADE682E34FC59D274825770B0037D278})

for every page including the start_url, I want to crawl all href found under //li/<a>( For every page it crawled, there would be abundant number of hyperlinks available, and some of them will duplicate because you can access both the parent and children sites on the same page.

As you may see, the href does not composite the actual link ( the link quoted above) we see when we crawl into that page. There is also a # in front of its useful content. Would it be the source of problem?

For restricted_xpaths,I have restricted the path to 'logout' the page.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
import scrapy

class kmssSpider(CrawlSpider):
    name='kmss'
    start_url = ('https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf',)
    login_page = 'https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login'
    allowed_domain = ["kmssqkr.sarg"]

    rules= (Rule(LinkExtractor(allow=(r'https://kmssqkr.sarg/LotusQuickr/dept/\w*'),restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique= True),
                  callback='parse_item', follow = True),
                                )
#    r"LotusQuickr/dept/^[ A-Za-z0-9_@./#&+-]*$"
#    restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique = True)

    def start_requests(self):
        yield Request(url=self.login_page, callback=self.login ,dont_filter = True
                )
    def login(self,response):
        return FormRequest.from_response(response,formdata={'user':'user','password':'pw'},
                                        callback = self.check_login_response)

    def check_login_response(self,response):
        if 'Welcome' in response.body:
            self.log("\n\n\n\n Successfuly Logged in \n\n\n ")
            yield Request(url=self.start_url[0])
        else:
            self.log("\n\n You are not logged in \n\n " )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        pass

Log:

2015-07-27 16:46:18 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-27 16:46:18 [boto] DEBUG: Retrieving credentials from metadata server.
2015-07-27 16:46:19 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open
    '_open', req)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2015-07-27 16:46:19 [boto] ERROR: Unable to read instance data, giving up
2015-07-27 16:46:19 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-27 16:46:19 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-27 16:46:19 [scrapy] INFO: Enabled item pipelines: 
2015-07-27 16:46:19 [scrapy] INFO: Spider opened
2015-07-27 16:46:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-27 16:46:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-27 16:46:24 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None)
2015-07-27 16:46:28 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr.ccgo.sarg/names.nsf?Login> (referer: https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login)
2015-07-27 16:46:29 [kmss] DEBUG: 



 Successfuly Logged in 



2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf>
2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument>
2015-07-27 16:46:29 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> (referer: https://kmssqkr.sarg/names.nsf?Login)
2015-07-27 16:46:29 [scrapy] INFO: Closing spider (finished)
2015-07-27 16:46:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1954,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 4,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 31259,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/302': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 27, 8, 46, 29, 286000),
 'log_count/DEBUG': 8,
 'log_count/ERROR': 2,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2015, 7, 27, 8, 46, 19, 528000)}
2015-07-27 16:46:29 [scrapy] INFO: Spider closed (finished)

  [1]: http://i.stack.imgur.com/REQXJ.png

----------------------------------UPDATED---------------------------------------

I saw the cookies format in http://doc.scrapy.org/en/latest/topics/request-response.html. These are my cookies on the site, but I am not sure what and How I should add them along with Request.

解决方案

First of all do not be demanding, sometimes I get angry and won't answer your question.

To see which cookies are sent with your Request enable debugging with COOKIES_DEBUG = True.

Then you will notice that cookies are not sent even if Scrapy's middleware should send those cookies. I think this is because you yield a custom request and Scrapy won't be more clever than you and accepts your solution to send this request without cookies.

This means you need to access the cookies from the response and add the required ones (or all) to your Request.

这篇关于登录后开始爬行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆