如何登录,然后下载从ASPX网页中有R文件 [英] How to login and then download a file from aspx web pages with R

查看:174
本文介绍了如何登录,然后下载从ASPX网页中有R文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对自动化的的文件的面板研究>本网页上的任何这些文件的使用R.点击进入将用户带到此登录/验证页面。认证后,可以很容易地下载使用Web浏览器中的文件。不幸的是, HTTR code以下不会出现是保持认证。我试图检查标题在Chrome的页面的Login.aspx(<一个href=\"http://stackoverflow.com/questions/10213194/use-rcurl-to-bypass-disclaimer-page-then-do-the-web-scrapping\">as这里)描述,但它不会出现,甚至维持认证时,我相信我传递所有的正确的价值观。如果它与 HTTR RCurl 或别的东西,我只是喜欢的事,做的工作我不关心在'r所以我不需要有这个脚本的用户必须手动或用一些完全独立的程序下载的文件。我的一个尝试在此是下面,但它不工作。任何帮助将是AP preciated。谢谢!! :D

 要求(HTTR)值&LT;  - 
    列表(
        ctl00 $ ContentPlaceHolder3 $ Login1 $用户名=you@email.com
        ctl00 $ ContentPlaceHolder3 $ Login1 $密码=somepassword
        ctl00 $ ContentPlaceHolder3 $ Login1 $ LoginButton=登录,
        _LASTFOCUS=,
        _EVENTTARGET=,
        _EVENTARGUMENT=
    )POST( \"http://simba.isr.umich.edu/u/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fZips%2fZipMain.aspx\" ,身体=值)RESP&LT; - 获取(http://simba.isr.umich.edu/Zips/GetFile.aspx,查询=列表(文件=1053))


解决方案

除了存储认证后的cookie(请参阅我的上述评论)出现在你的解决方案的另一个问题点:ASP.net网站设置了一个 VIEWSTATE 键值对,其中是在查询中保留cookie的 - 如果你检查,你甚至不能在你的榜样登录(即 POST的结果命令包含有关如何登录,刚检查出来)的信息。

可能的解决方案的概要:


  1. 加载 RCurl 包:

     &GT;库(RCurl)


  2. 设置一些方便的卷曲选项:

     &GT;卷曲= getCurlHandle()
    &GT; curlSetOpt(cookiejar ='cookie.txt的',followlocation = TRUE,autoreferer = TRUE,卷曲=卷曲)


  3. 加载页面,第一次捕捉到 VIEWSTATE

     &GT; HTML&LT;  - 的getURL(http://simba.isr.umich.edu/u/Login.aspx',卷曲=卷曲)


  4. 提取 VIEWSTATE 用常规的前pression或任何其他工具:

     &GT;视图状态&LT;  -  as.character(分('。* ID =__ VIEWSTATEVALUE =([0-9A-ZA-Z + / =] *)*','\\\\ 1',HTML))


  5. 参数设置用户名,密码的 VIEWSTATE 的:

     &GT; PARAMS&LT;  - 列表(
        ctl00 $ ContentPlaceHolder3 $ Login1 $ USERNAME ='&LT;用户名&gt;,
        ctl00 $ ContentPlaceHolder3 $ Login1 $密码'='&LT;密码和GT;,
        ctl00 $ ContentPlaceHolder3 $ Login1 $ LoginButton'='登录',
        __VIEWSTATE'=视图状态
        )


  6. 登录最后:

     &GT; HTML = postForm('http://simba.isr.umich.edu/u/Login.aspx',.params =参数,卷曲=卷曲)

    恭喜,你现在登录和卷曲保存cookie的验证!


  7. 请验证您是否已经登录:

     &GT; grepl('注销',HTML)
    [1] TRUE


  8. 所以,你可以继续下载任何文件 - 只是一定要通过卷曲=卷曲在您所有的疑问


I'm trying to automate the download of the Panel Study of Income Dynamics files available on this web page using R. Clicking on any of those files takes the user through to this login/authentication page. After authentication, it's easy to download the files with your web browser. Unfortunately, the httr code below does not appear to be maintaining the authentication. I have tried inspecting the Headers in Chrome for the Login.aspx page (as described here), but it doesn't appear to maintain the authentication even when I believe I'm passing in all the correct values. I don't care if it's done with httr or RCurl or something else, I'd just like something that works inside R so I don't need to have users of this script have to download the files manually or with some completely separate program. One of my attempts at this is below, but it doesn't work. Any help would be appreciated. Thanks!! :D

require(httr)

values <- 
    list( 
        "ctl00$ContentPlaceHolder3$Login1$UserName" = "you@email.com" , 
        "ctl00$ContentPlaceHolder3$Login1$Password" = "somepassword" ,
        "ctl00$ContentPlaceHolder3$Login1$LoginButton" = "Log In" ,
        "_LASTFOCUS" = "" ,
        "_EVENTTARGET" = "" ,
        "_EVENTARGUMENT" = "" 
    )

POST( "http://simba.isr.umich.edu/u/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fZips%2fZipMain.aspx" , body = values )

resp <- GET( "http://simba.isr.umich.edu/Zips/GetFile.aspx" , query = list( file = "1053" ) )

解决方案

Beside storing the cookie after authentication (see my above comment) there was another problematic point in your solution: the ASP.net site sets a VIEWSTATE key-value pair in the cookie which is to be reserved in your queries - if you check, you could not even login in your example (the result of the POST command holds info about how to login, just check it out).

An outline of a possible solution:

  1. Load RCurl package:

    > library(RCurl)
    

  2. Set some handy curl options:

    > curl = getCurlHandle()
    > curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
    

  3. Load the page for the first time to capture VIEWSTATE:

    > html <- getURL('http://simba.isr.umich.edu/u/Login.aspx', curl = curl)
    

  4. Extract VIEWSTATE with a regular expression or any other tool:

    > viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
    

  5. Set the parameters as your username, password and the VIEWSTATE:

    > params <- list(
        'ctl00$ContentPlaceHolder3$Login1$UserName'    = '<USERNAME>',
        'ctl00$ContentPlaceHolder3$Login1$Password'    = '<PASSWORD>',
        'ctl00$ContentPlaceHolder3$Login1$LoginButton' = 'Log In',
        '__VIEWSTATE'                                  = viewstate
        )
    

  6. Log in at last:

    > html = postForm('http://simba.isr.umich.edu/u/Login.aspx', .params = params, curl = curl)
    

    Congrats, now you are logged in and curl holds the cookie verifying that!

  7. Verify if you are logged in:

    > grepl('Logout', html)
    [1] TRUE
    

  8. So you can go ahead and download any file - just be sure to pass curl = curl in all your queries.

这篇关于如何登录,然后下载从ASPX网页中有R文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆