使用RCurl绕过“免责页面"然后进行网页抓取 [英] Use RCurl to bypass "disclaimer page" then do the web scraping

查看:179
本文介绍了使用RCurl绕过“免责页面"然后进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个链接,如 这个 我想使用 RCurl 从中提取数据,在此之前有一个免责声明页面,我需要在浏览器中单击它才能评估数据.以前我使用下面的脚本,它来自 此处,绕过"免责声明页面并使用 RCurl 访问数据:

I have a link like this one that I would like to extract data from it using RCurl, there is a disclaimer page before that and I need to click it in my browser before I can assess the data. Previously I use the script below, which is from here, to "bypass" disclaimer page and access the data using RCurl:

 pagesource <- getURL(url,.opts=curlOptions(followlocation=TRUE,cookiefile="nosuchfile"))
 doc <- htmlParse(pagesource)

以前可以用,但最近几天不行了.其实我不太清楚它的代码是做什么的,我想知道我是否必须在curlOptions中改变一些东西,或者重新编写整段代码?

It works before, but in recent few days it no long works. Actually I don't have much idea on what the code it doing, I wonder if I have to change something in the curlOptions, or re-write the whole piece of code?

谢谢.

推荐答案

正如我在评论中提到的,您的问题的解决方案将完全取决于免责声明页面"的实施.看起来之前的解决方案使用了更详细定义的 cURL 选项此处.基本上,它指示 cURL 做的是提供一个假的 cookie 文件(名为nosuchfile"),然后按照您尝试访问的站点给出的标头重定向.显然,该网站的设置方式是,如果访问者声称没有正确的 cookie,那么它会立即将访问者重定向到免责声明页面.

As I mention in my comment, the solution to your problem will totally depend on the implementation of the "disclaimer page." It looks like the previous solution used cURL options defined in more detail here. Basically, what it's instructing cURL to do is to provide a fake cookies file (named "nosuchfile") and then followed the header redirect given by the site you were trying to access. Apparently that site was setup in such a way that if a visitor claimed not to have the proper cookies, then it would immediately redirect the visitor past the disclaimer page.

您不会碰巧在工作目录中创建了名为nosuchfile"的文件,对吗?如果没有,听起来目标网站改变了其免责声明页面的运作方式.如果是这种情况,除非我们拥有您尝试访问的实际页面以进行诊断,否则我们真的无法提供任何帮助.

You didn't happen to create a file named "nosuchfile" in your working directory, did you? If not, it sounds like the target site changed the way its disclaimer page operates. If that's the case, there's really no help we can provide unless we have the actual page you're trying to access to diagnose.

在您在问题中引用的示例中,他们使用 Javascript 来跳过免责声明,这可能很难通过.

In the example you reference in your question, they're using Javascript to move past the disclaimer, which could be tricky to get past.

对于您提到的示例,但是...

For the example you mention, however...

  1. 在 Chrome(或带有 Firebug 的 Firefox)中打开它
  2. 右键单击页面中的一些空白区域并选择检查元素"
  3. 点击网络标签
  4. 如果那里有内容,请点击底部的清除"按钮以清空页面.
  5. 接受许可协议
  6. 观察通过网络的所有流量.就我而言,最重要的结果是有趣的结果.如果单击它,您可以预览它以验证它确实是一个 HTML 文档.如果您单击该项目下的标题"选项卡,它将显示请求 URL".就我而言,那是: http://bank.hangseng.com/1/PA_1_1_P1/ComSvlet_MiniSite_eng_end_gif&end_year=2012&data_selection2=0"U42360&data_selection=0&keyword=U42360&start_day=30&start_month=03&start_year=2012&end_day=18&end_month=04&end_year=2012&data_selection2=0

您可以直接访问该 URL,而无需接受任何许可协议,无论是手动还是通过 cURL.

You can access that URL directly without having to accept any license agreement, either by hand or from cURL.

请注意,如果您已接受该协议,则该站点会存储一个 cookie,说明需要删除该 cookie 才能返回许可协议页面.您可以通过单击资源"选项卡,然后转到Cookie"并删除每一个,然后刷新您在上面发布的 URL 来执行此操作.

Note that if you've already accepted the agreement, this site stores a cookie stating such which will need to be deleted in order to get back to the license agreement page. You can do this by clicking the "Resources" tab, then going to "Cookies" and deleting each one, then refreshing the URL you posted above.

这篇关于使用RCurl绕过“免责页面"然后进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆