使用 RCurl 或 httr 在 R 中自动登录英国数据服务网站 [英] automating the login to the uk data service website in R with RCurl or httr

查看:15
本文介绍了使用 RCurl 或 httr 在 R 中自动登录英国数据服务网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为 http://asdfree.com/编写一组可免费下载的 R 脚本.a> 帮助人们分析由英国数据服务托管的复杂样本调查数据.除了为这些数据集提供大量的统计教程外,我还想自动化下载和导入这些调查数据.为了做到这一点,我需要弄清楚如何以编程方式登录到这个 英国数据服务网站.

I am in the process of writing a collection of freely-downloadable R scripts for http://asdfree.com/ to help people analyze the complex sample survey data hosted by the UK data service. In addition to providing lots of statistics tutorials for these data sets, I also want to automate the download and importation of this survey data. In order to do that, I need to figure out how to programmatically log into this UK data service website.

我已经尝试了多种不同的 RCurlhttr 配置来登录,但我在某处犯了错误并且卡住了.我试过检查元素 如本文所述,但网站在浏览器中跳转的速度太快,我无法理解发生了什么.

I have tried lots of different configurations of RCurl and httr to log in, but I'm making a mistake somewhere and I'm stuck. I have tried inspecting the elements as outlined in this post, but the websites jump around too fast in the browser for me to understand what's going on.

这个网站确实需要登录名和密码,但我相信我在进入登录页面之前就犯了一个错误.

This website does require a login and password, but I believe I'm making a mistake before I even get to the login page.

起始页应该是:https://www.esds.ac.uk/secure/UKDSRegister_start.asp

此页面会自动将您的 Web 浏览器重定向到以以下内容开头的长 URL:https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]

This page will automatically re-direct your web browser to a long URL that starts with: https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]

(1) 由于某种原因,SSL 证书在本网站上不起作用.这是 所以我发布了关于这个的问题.我使用的解决方法只是忽略 SSL:

(1) For some reason, the SSL certificate does not work on this website. Here's the SO question I posted regarding this. The workaround I've used is simply ignoring the SSL:

library(httr)
set_config( config( ssl.verifypeer = 0L ) )

然后我在起始网站上的第一个命令是:

and then my first command on the starting website is:

z <- GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )

这给了我一个 z$url,它看起来很像 https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah] 我的浏览器也重定向到的页面.

this gives me back a z$url that looks a lot like the https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah] page that my browser also re-directs to.

然后,您应该在浏览器中输入uk data archive"并单击continue 按钮.当我这样做时,它会将我重定向到网页 https://shib.data-archive.ac.uk/idp/Authn/UserPassword

In the browser, then, you're supposed to type in "uk data archive" and click the continue button. When I do that, it re-directs me to the web page https://shib.data-archive.ac.uk/idp/Authn/UserPassword

我认为这是我被卡住的地方,因为我不知道如何让 cURL followlocation 登陆这个网站.注意:尚未输入用户名/密码.

I think this is where I'm stuck because I cannot figure out how to have cURL followlocation and land on this website. Note: no username/password has been entered yet.

当我像这样使用来自 wayf.ukfederation.org.uk 页面的 httr GET 命令时:

When I use the httr GET command from the wayf.ukfederation.org.uk page like this:

 y <- GET( z$url , query = list( combobox = "https://shib.data-archive.ac.uk/shibboleth-idp" ) )

y$url 字符串看起来很像 z$url(除了最后有一个组合框=).有什么办法可以通过 RCurlhttr 进入这个 uk data archive 身份验证页面?

the y$url string looks a lot like z$url (except it's got a combobox= on the end). Is there any way to get through to this uk data archive authentication page with RCurl or httr?

我不知道我是否只是忽略了某些东西,或者我是否绝对必须使用 我以前的 SO 帖子 还是什么?

I can't tell if I'm just overlooking something or if I absolutely must use the SSL certificate described in my previous SO post or what?

(2) 当我进入那个页面时,我相信剩下的代码就是:

(2) At the point I do make it through to that page, I believe the remainder of the code would just be:

values <- list( j_username = "your.username" , 
                j_password = "your.password" )
POST( "https://shib.data-archive.ac.uk/idp/Authn/UserPassword" , body = values)

但我想那个页面将不得不等待...

But I guess that page will have to wait...

推荐答案

表单返回的相关数据变量是actionorigin,而不是combobox.给 actionselectionorigin 来自 combobox

The relevant data variables returned by the form are action and origin, not combobox. Give action the value selection and origin the value from the relevant entry in combobox

y <- GET( z$url, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
> y$url
[1] "https://shib.data-archive.ac.uk:443/idp/Authn/UserPassword"

编辑

看起来好像句柄池没有使您的会话正确地保持活动状态.因此,您需要直接而不是自动传递句柄.同样对于 POST 命令,您需要设置 multipart=FALSE 因为这是 HTML 表单的默认设置.R 命令具有不同的默认值,因为它主要用于上传文件.所以:

It looks as though the handle pool isn't keeping your session alive correctly. You therefore need to pass the handles directly rather than automatically. Also for the POST command you need to set multipart=FALSE as this is the default for HTML forms. The R command has a different default as it is mainly designed for uploading files. So:

y <- GET( handle=z$handle, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
POST(body=values,multipart=FALSE,handle=y$handle)
Response [https://www.esds.ac.uk/]
  Status: 200
  Content-type: text/html

...snipped...    


                <title>

                        Introduction to ESDS

                </title>

                <meta name="description" content="Introduction to the ESDS, home page" />

这篇关于使用 RCurl 或 httr 在 R 中自动登录英国数据服务网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆