如何登录,然后下载从ASPX网页中有R文件 [英] How to login and then download a file from aspx web pages with R
问题描述
我想对自动化的的文件的面板研究>本网页上的任何这些文件的使用R.点击进入将用户带到此登录/验证页面。认证后,可以很容易地下载使用Web浏览器中的文件。不幸的是, HTTR
code以下不会出现是保持认证。我试图检查标题
在Chrome的页面的Login.aspx(<一个href=\"http://stackoverflow.com/questions/10213194/use-rcurl-to-bypass-disclaimer-page-then-do-the-web-scrapping\">as这里)描述,但它不会出现,甚至维持认证时,我相信我传递所有的正确的价值观。如果它与 HTTR
或 RCurl
或别的东西,我只是喜欢的事,做的工作我不关心在'r所以我不需要有这个脚本的用户必须手动或用一些完全独立的程序下载的文件。我的一个尝试在此是下面,但它不工作。任何帮助将是AP preciated。谢谢!! :D
要求(HTTR)值&LT; -
列表(
ctl00 $ ContentPlaceHolder3 $ Login1 $用户名=you@email.com
ctl00 $ ContentPlaceHolder3 $ Login1 $密码=somepassword
ctl00 $ ContentPlaceHolder3 $ Login1 $ LoginButton=登录,
_LASTFOCUS=,
_EVENTTARGET=,
_EVENTARGUMENT=
)POST( \"http://simba.isr.umich.edu/u/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fZips%2fZipMain.aspx\" ,身体=值)RESP&LT; - 获取(http://simba.isr.umich.edu/Zips/GetFile.aspx,查询=列表(文件=1053))
除了存储认证后的cookie(请参阅我的上述评论)出现在你的解决方案的另一个问题点:ASP.net网站设置了一个 VIEWSTATE
键值对,其中是在查询中保留cookie的 - 如果你检查,你甚至不能在你的榜样登录(即 POST的结果
命令包含有关如何登录,刚检查出来)的信息。
可能的解决方案的概要:
-
加载
RCurl
包:&GT;库(RCurl)
-
设置一些方便的
卷曲
选项:&GT;卷曲= getCurlHandle()
&GT; curlSetOpt(cookiejar ='cookie.txt的',followlocation = TRUE,autoreferer = TRUE,卷曲=卷曲) -
加载页面,第一次捕捉到
VIEWSTATE
:&GT; HTML&LT; - 的getURL(http://simba.isr.umich.edu/u/Login.aspx',卷曲=卷曲)
-
提取
VIEWSTATE
用常规的前pression或任何其他工具:&GT;视图状态&LT; - as.character(分('。* ID =__ VIEWSTATEVALUE =([0-9A-ZA-Z + / =] *)*','\\\\ 1',HTML))
-
参数设置用户名,密码的和
VIEWSTATE
的:&GT; PARAMS&LT; - 列表(
ctl00 $ ContentPlaceHolder3 $ Login1 $ USERNAME ='&LT;用户名&gt;,
ctl00 $ ContentPlaceHolder3 $ Login1 $密码'='&LT;密码和GT;,
ctl00 $ ContentPlaceHolder3 $ Login1 $ LoginButton'='登录',
__VIEWSTATE'=视图状态
) -
登录最后:
&GT; HTML = postForm('http://simba.isr.umich.edu/u/Login.aspx',.params =参数,卷曲=卷曲)
恭喜,你现在登录和
卷曲
保存cookie的验证! -
请验证您是否已经登录:
&GT; grepl('注销',HTML)
[1] TRUE -
所以,你可以继续下载任何文件 - 只是一定要通过
卷曲=卷曲
在您所有的疑问
I'm trying to automate the download of the Panel Study of Income Dynamics files available on this web page using R. Clicking on any of those files takes the user through to this login/authentication page. After authentication, it's easy to download the files with your web browser. Unfortunately, the httr
code below does not appear to be maintaining the authentication. I have tried inspecting the Headers
in Chrome for the Login.aspx page (as described here), but it doesn't appear to maintain the authentication even when I believe I'm passing in all the correct values. I don't care if it's done with httr
or RCurl
or something else, I'd just like something that works inside R so I don't need to have users of this script have to download the files manually or with some completely separate program. One of my attempts at this is below, but it doesn't work. Any help would be appreciated. Thanks!! :D
require(httr)
values <-
list(
"ctl00$ContentPlaceHolder3$Login1$UserName" = "you@email.com" ,
"ctl00$ContentPlaceHolder3$Login1$Password" = "somepassword" ,
"ctl00$ContentPlaceHolder3$Login1$LoginButton" = "Log In" ,
"_LASTFOCUS" = "" ,
"_EVENTTARGET" = "" ,
"_EVENTARGUMENT" = ""
)
POST( "http://simba.isr.umich.edu/u/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fZips%2fZipMain.aspx" , body = values )
resp <- GET( "http://simba.isr.umich.edu/Zips/GetFile.aspx" , query = list( file = "1053" ) )
Beside storing the cookie after authentication (see my above comment) there was another problematic point in your solution: the ASP.net site sets a VIEWSTATE
key-value pair in the cookie which is to be reserved in your queries - if you check, you could not even login in your example (the result of the POST
command holds info about how to login, just check it out).
An outline of a possible solution:
Load
RCurl
package:> library(RCurl)
Set some handy
curl
options:> curl = getCurlHandle() > curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
Load the page for the first time to capture
VIEWSTATE
:> html <- getURL('http://simba.isr.umich.edu/u/Login.aspx', curl = curl)
Extract
VIEWSTATE
with a regular expression or any other tool:> viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
Set the parameters as your username, password and the
VIEWSTATE
:> params <- list( 'ctl00$ContentPlaceHolder3$Login1$UserName' = '<USERNAME>', 'ctl00$ContentPlaceHolder3$Login1$Password' = '<PASSWORD>', 'ctl00$ContentPlaceHolder3$Login1$LoginButton' = 'Log In', '__VIEWSTATE' = viewstate )
Log in at last:
> html = postForm('http://simba.isr.umich.edu/u/Login.aspx', .params = params, curl = curl)
Congrats, now you are logged in and
curl
holds the cookie verifying that!Verify if you are logged in:
> grepl('Logout', html) [1] TRUE
So you can go ahead and download any file - just be sure to pass
curl = curl
in all your queries.
这篇关于如何登录,然后下载从ASPX网页中有R文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!