如何使用scrapy中的CrawlSpider点击一个带有javascript onclick的链接? [英] How to use CrawlSpider from scrapy to click a link with javascript onclick?

查看:422
本文介绍了如何使用scrapy中的CrawlSpider点击一个带有javascript onclick的链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望scrapy抓取页面进入下一个链接,如下所示:

I want scrapy to crawl pages where going on to the next link looks like this:

<a href="#" onclick="return gotoPage('2');"> Next </a>

scrapy能否解释其中的javascript代码?

Will scrapy be able to interpret javascript code of that?

使用 livehttpheaders 扩展名,我发现单击下一步会生成一个POST,其中包含一个非常大的垃圾,如下所示:

With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:

encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n

我正在尝试构建我的蜘蛛在 CrawlSpider 类上,但我无法弄清楚如何编码,使用 BaseSpider 我使用了 parse()处理第一个URL的方法,这恰好是一个登录表单,我用POST执行了POST:

I am trying to build my spider on the CrawlSpider class, but I can't really figure out how to code it, with BaseSpider I used the parse() method to process the first URL, which happens to be a login form, where I did a POST with:

def logon(self, response):
    login_form_data={ 'email': 'user@example.com', 'password': 'mypass22', 'action': 'sign-in' }
    return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]

然后我定义了sub mit_next()告诉下一步该怎么做。我无法弄清楚如何告诉CrawlSpider在第一个URL上使用哪种方法?

And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?

我抓取的所有请求(第一个除外)都是POST请求。它们交替使用两种类型的请求:粘贴一些数据,然后单击下一步转到下一页。

All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.

推荐答案

实际方法如下:


  1. 发布您的到达页面的请求(正如您所做)

  2. 从特定回复中提取指向下一页的链接

  3. 如果可能,请简单请求下一页或在适用的情况下再次使用FormRequest

所有这一切都必须通过服务器响应机制进行简化,例如:

All this have to be streamlined with the server response mechanism, e.g:


  • 您可以尝试使用 dont_click = true in FormRequest.from_response

  • 或者你可能想要处理来自服务器的重定向(302)(在这种情况下,您必须在元中提到您需要将句柄重定向请求发送到回调。)

  • You can try using dont_click = true in FormRequest.from_response
  • Or you may want to handle the redirection (302) coming from the server (in which case you will have to mention in the meta that you require the handle redirect request also to be sent to callback.)

现在该如何解决这个问题:
使用像f这样的网络调试器iddler或者你可以使用Firefox插件FireBug,或者只是在IE 9中点击F12;并检查用户在网站上实际发出的请求是否与您抓取网页的方式相符。

Now how to figure it all out: Use a web debugger like fiddler or you can use Firefox plugin FireBug, or simply hit F12 in IE 9; and check the requests a user actually makes on the website match the way you are crawling the webpage.

这篇关于如何使用scrapy中的CrawlSpider点击一个带有javascript onclick的链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆