使用RCurl或httr自动登录到R中的uk数据服务网站 [英] automating the login to the uk data service website in R with RCurl or httr

查看:398
本文介绍了使用RCurl或httr自动登录到R中的uk数据服务网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为 http://asdfree.com/ 编写一系列可自由下载的R脚本,以帮助人们分析由英国数据服务托管的复杂抽样调查数据。除了为这些数据集提供大量统计教程之外,我还希望自动执行此调查数据的下载和导入。为此,我需要了解如何以编程方式登录此英国数据服务网站

I am in the process of writing a collection of freely-downloadable R scripts for http://asdfree.com/ to help people analyze the complex sample survey data hosted by the UK data service. In addition to providing lots of statistics tutorials for these data sets, I also want to automate the download and importation of this survey data. In order to do that, I need to figure out how to programmatically log into this UK data service website.

我已尝试过许多不同的 RCurl httr 配置来登录,但我正在错误在某个地方,我被卡住了。我已尝试检查元素

I have tried lots of different configurations of RCurl and httr to log in, but I'm making a mistake somewhere and I'm stuck. I have tried inspecting the elements as outlined in this post, but the websites jump around too fast in the browser for me to understand what's going on.

这个网站需要一个登录和

This website does require a login and password, but I believe I'm making a mistake before I even get to the login page.

起始页应为: https://www.esds.ac.uk/secure/ UKDSRegister_start.asp

此网页会自动将您的网络浏览器重新导向至以下开头的长网址: https:// wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]

This page will automatically re-direct your web browser to a long URL that starts with: https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]

(1)由于某种原因,SSL证书不在这个网站上工作。这里是 SO问题我张贴了这个。我使用的解决方法是简单地忽略SSL:

(1) For some reason, the SSL certificate does not work on this website. Here's the SO question I posted regarding this. The workaround I've used is simply ignoring the SSL:

library(httr)
set_config( config( ssl.verifypeer = 0L ) )

,然后我的第一个命令在起始网站是:

and then my first command on the starting website is:

z <- GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )

这给我一个看起来很像的 z $ url https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah] 页面,我的浏览器也重定向到。

this gives me back a z$url that looks a lot like the https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah] page that my browser also re-directs to.

然后,在浏览器中输入uk data archive,然后点击 continue 按钮。当我这样做,它会重定向到网页 https://shib.data-archive。 ac.uk/idp/Authn/UserPassword

In the browser, then, you're supposed to type in "uk data archive" and click the continue button. When I do that, it re-directs me to the web page https://shib.data-archive.ac.uk/idp/Authn/UserPassword

我认为这是我被困的地方,因为我不知道如何有cURL followlocation 并登陆本网站。

I think this is where I'm stuck because I cannot figure out how to have cURL followlocation and land on this website. Note: no username/password has been entered yet.

当我使用 httr GET 命令从wayf.ukfederation.org.uk页面这样:

When I use the httr GET command from the wayf.ukfederation.org.uk page like this:

 y <- GET( z$url , query = list( combobox = "https://shib.data-archive.ac.uk/shibboleth-idp" ) )

y $ url 字符串看起来很像 z $ url (除了它有一个组合框=结束)。是否有任何方法可以通过 RCurl httr 来验证此 uk数据存档身份验证页面?

the y$url string looks a lot like z$url (except it's got a combobox= on the end). Is there any way to get through to this uk data archive authentication page with RCurl or httr?

我不知道我是否只是俯瞰某事,或者如果我绝对必须使用 my previous SO post or what?

I can't tell if I'm just overlooking something or if I absolutely must use the SSL certificate described in my previous SO post or what?

(2)在我到达那个页面之前,我相信代码的其余部分只是:

(2) At the point I do make it through to that page, I believe the remainder of the code would just be:

values <- list( j_username = "your.username" , 
                j_password = "your.password" )
POST( "https://shib.data-archive.ac.uk/idp/Authn/UserPassword" , body = values)

将不得不等待...

推荐答案

表单返回的相关数据变量为 action origin ,而不是 combobox 。给予动作选择 origin combobox中的相关条目

The relevant data variables returned by the form are action and origin, not combobox. Give action the value selection and origin the value from the relevant entry in combobox

y <- GET( z$url, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
> y$url
[1] "https://shib.data-archive.ac.uk:443/idp/Authn/UserPassword"

编辑

看起来好像句柄池不能使您的会话保持活动。因此,您需要直接传递句柄,而不是自动传递。同样对于 POST 命令,您需要设置 multipart = FALSE ,因为这是 HTML表单的默认值。 R命令有一个不同的默认值,因为它主要设计用于上传文件。所以:

It looks as though the handle pool isn't keeping your session alive correctly. You therefore need to pass the handles directly rather than automatically. Also for the POST command you need to set multipart=FALSE as this is the default for HTML forms. The R command has a different default as it is mainly designed for uploading files. So:

y <- GET( handle=z$handle, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
POST(body=values,multipart=FALSE,handle=y$handle)
Response [https://www.esds.ac.uk/]
  Status: 200
  Content-type: text/html

...snipped...    


                <title>

                        Introduction to ESDS

                </title>

                <meta name="description" content="Introduction to the ESDS, home page" />

这篇关于使用RCurl或httr自动登录到R中的uk数据服务网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆