Web锁定CSV到r中的数据帧 [英] Web locked CSV to dataframe in r
问题描述
我在我试图访问的私人网络服务器上有一个文件。我必须首先去一个网站,用我的凭据登录,然后我可以键入一个URL(没有链接)来访问该文件,它立即下载一个csv文件到计算机。我试图让该csv文件自动加载到R或直接从在线或自动下载和从我的硬盘驱动器上传
我要刷新这个数据每天10-15次,这就是为什么我需要它自动,而不是每次手动下载。
我尝试过几个软件包,并对Hadley的软件包rvest印象深刻,这显示比我过去使用的一些东西容易得多。我正在成功下载数据:
库(rvest)
fs< - html_session(somewebsite.org )
fs.login< - fs%>%follow_link(登录)
login.form< - html_form(fs.login)[[1]]
login .form< -set_values(login.form,userName =myusername,password =mypassword)
active.session< - submit_form(fs.login,login.form)
my.data < - jump_to(active.session,somewebsite.org/report/groups)
使用计时器运行它几次,它平均需要27秒,这表示它正在下载文件(大致相同的谷歌Chrome浏览器)。结果是具有7个元素的变量类会话43.7 Mb
my.data
somewebsite / report / groups
状态: p>
类型:text / csv
大小:45856046
我的问题是如何访问r中的实际csv文件或数据?
str (my.data)
7 $ b的列表$ b $ handle:2的列表
.. $ handle:Class'curl_handle'< externalptr>
.. $ url:chrsomewebsite.org
..- attr(*,class)= chrhandle
$ config:7
的列表。 。$ method:NULL
.. $ url:NULL
.. $ headers:NULL
.. $ fields:NULL
.. $ options:list of 1
.. .. $ autoreferer:int 1
.. $ auth_token:NULL
.. $ output:NULL
..- attr(*,class)= chrrequest
$ url:chrhttps://somewebsite.org/report/groups
$ back:chrhttps://somewebsite.org/report/groups
$ forward:chr (0)
$响应:列表10
.. $ url:chrhttps://somewebsite.org/report/groups
.. $ status_code:int 200
.. $ headers:6 of 6
.. .. $ content-disposition:chrattachment; filename = \groups-2016-0318-063749.csv\
.. .. $ content-type:chrtext / csv
.. .. $ date:chrFri,18 Mar 2016 18:37:49 GMT
.. .. $ server:chr Apache-Coyote / 1.1
.. .. $ transfer-encoding:chrchunked
.. .. $ connection:chr关闭
.. ..- attr *,class)= chr [1:2]不敏感列表
.. $ all_headers:1的列表
.. .. $:3的列表
.. .. .. $ status:int 200
.. .. .. $ version:chrHTTP / 1.1
.. .. .. $ headers:6
的列表.. .. .. .. $ content-disposition:chrattachment; filename = \groups-2016-0318-063749.csv\
.. .. .. .. $ content-type:chrtext / csv
.. .. 。$ date:chrFri,18 Mar 2016 18:37:49 GMT
.. .. .. .. $ server:chrApache-Coyote / 1.1
.. .. .. $ transfer-encoding:chrchunked
.. .. .. .. $ connection:chr关闭
.. .. .. ..- attr(*, class)= chr [1:2]insensitivelist
.. $ cookies:'data.frame':7个变量的6个obs:
.. .. $ domain: chr [1:6]somewebsite.org#HttpOnly_.site.orgsignin.site.org.site.org...
.. .. $ flag:logi [1: 6] FALSE TRUE FALSE TRUE FALSE TRUE
.. .. $ path:chr [1:6]////...
.. .. :logi [1:6] FALSE TRUE FALSE FALSE TRUE TRUE
.. .. $ expiration:POSIXct [1:6],格式:2017-03-18 12:37:16NA NA NA .. 。
.. .. $ name:chr [1:6]fs_experimentsObssOCookieTS01289383TS01b89640...
.. .. $ value:chr [1:6] u%3D-anon-%2Ca%3Dshared-ui%2Cs%3Dac76fc702b255a493a5856b5432b92b4%2Cv%3D0100110011010000000111111111001110101101000000000001100| __truncated__15yUK2dU%2B78GK7o587gtwh3i%2ByORXGD8ne5XJBiGkiHpDAJ3%2F7rQ4Gql6T5DqQIwCg%2FSwSioAMIzzaRxGEFKsCkc%2BGohi1fdWhbR0urah6%2BJikm9lA6| __truncated__01999b7023d69473f53740d0f7f2969d9d79e1a18c7e259f6baf643ce642a330fc0a3604d701999b7023960237ab42ec3f429e5a452fe3559d683a090b19a65cf66ce0c01bc21bdb29bf78f030d36d4eeff4dec21ff185c54b06......
.. $内容:生[1:45857717] 69 64 2C 6E ...
.. $日期:POSIXct [1:1],格式为: 2016-03-18 18:37:49
.. $ times:Named num [1:6] 0 0 0.062 0.156 27.425 ...
.. ..- attr(*, name)= chr [1:6]redirectnamelookupconnectpretransfer...
.. $ request:7的列表
.. .. $ method:chr GET
.. .. $ url:chrhttps://somewebsite.org/report/groups
.. .. $ headers:Named chrapplication / json,text / xml,application / xml,* / *
.. .. ..- attr(*,names)= chrAccept
.. .. $ fields:NULL
.. 。$ options:4 of 4
.. .. .. $ useragent:chrlibcurl / 7.43.0 r-curl / 0.9.6 httr / 1.0.0
.. .. .. $ cainfo:chrC:/Users/Thisuser/Documents/R/win-library/3.2/httr/cacert.pem
.. .. .. $ autoreferer:int 1
.. .. $ customrequest:chrGET
.. .. $ auth_token:NULL
.. .. $ output:list()
.. .. ..- attr(* ,class)= chr [1:2]write_memorywrite_function
.. ..- attr(*,class)= chrrequest
.. $ handle:Class 'curl_handle'< externalptr>
..- attr(*,class)= chrresponse
$ html:< environment:0x000000001aad2f60&
- attr(*,class)= chrsession
数据存储在名为content的列表项中。来自readr包的 read_csv
应该能够直接读取。
请尝试以下操作:
library(httr)
library(readr)
read_csv(my.data $ content)
I have a file on a private web server I am trying to access. I must first go to a site and login with my credentials and then I can type a URL (there is no link) to access the file, which immediately downloads a csv file to the computer. I am trying to get that csv file to automatically load into R either direct from online or have it automatically download and uploaded from my hard drive
I am going to be refreshing this data 10-15 times a day which is why I need it automatic rather than manually downloading it every time.
I have tried a with several packages and have been impressed with Hadley's package rvest which has shown much easier than some things I have used in the past. I am succeeding in downloading the data:
library(rvest)
fs <- html_session("somewebsite.org")
fs.login <- fs %>% follow_link("Sign In")
login.form <- html_form(fs.login)[[1]]
login.form <-set_values(login.form, userName = "myusername", password = "mypassword")
active.session <- submit_form(fs.login, login.form)
my.data <- jump_to(active.session, "somewebsite.org/report/groups")
I have ran it with a timer several times and it takes an average of 27 seconds which indicates it is downloading the file (roughly the same amount that it takes Google Chrome). The result is a variable class session with 7 elements 43.7 Mb
my.data
somewebsite/report/groups
Status: 200
Type: text/csv
Size: 45856046
My question is how can I access the actual csv file or data in r?
str(my.data)
List of 7
$ handle :List of 2
..$ handle:Class 'curl_handle' <externalptr>
..$ url : chr "somewebsite.org"
..- attr(*, "class")= chr "handle"
$ config :List of 7
..$ method : NULL
..$ url : NULL
..$ headers : NULL
..$ fields : NULL
..$ options :List of 1
.. ..$ autoreferer: int 1
..$ auth_token: NULL
..$ output : NULL
..- attr(*, "class")= chr "request"
$ url : chr "https://somewebsite.org/report/groups"
$ back : chr "https://somewebsite.org/report/groups"
$ forward : chr(0)
$ response:List of 10
..$ url : chr "https://somewebsite.org/report/groups"
..$ status_code: int 200
..$ headers :List of 6
.. ..$ content-disposition: chr "attachment; filename=\"groups-2016-0318-063749.csv\""
.. ..$ content-type : chr "text/csv"
.. ..$ date : chr "Fri, 18 Mar 2016 18:37:49 GMT"
.. ..$ server : chr "Apache-Coyote/1.1"
.. ..$ transfer-encoding : chr "chunked"
.. ..$ connection : chr "Close"
.. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
..$ all_headers:List of 1
.. ..$ :List of 3
.. .. ..$ status : int 200
.. .. ..$ version: chr "HTTP/1.1"
.. .. ..$ headers:List of 6
.. .. .. ..$ content-disposition: chr "attachment; filename=\"groups-2016-0318-063749.csv\""
.. .. .. ..$ content-type : chr "text/csv"
.. .. .. ..$ date : chr "Fri, 18 Mar 2016 18:37:49 GMT"
.. .. .. ..$ server : chr "Apache-Coyote/1.1"
.. .. .. ..$ transfer-encoding : chr "chunked"
.. .. .. ..$ connection : chr "Close"
.. .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
..$ cookies :'data.frame': 6 obs. of 7 variables:
.. ..$ domain : chr [1:6] "somewebsite.org" "#HttpOnly_.site.org" "signin.site.org" ".site.org" ...
.. ..$ flag : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
.. ..$ path : chr [1:6] "/" "/" "/" "/" ...
.. ..$ secure : logi [1:6] FALSE TRUE FALSE FALSE TRUE TRUE
.. ..$ expiration: POSIXct[1:6], format: "2017-03-18 12:37:16" NA NA NA ...
.. ..$ name : chr [1:6] "fs_experiments" "ObSSOCookie" "TS01289383" "TS01b89640" ...
.. ..$ value : chr [1:6] "u%3D-anon-%2Ca%3Dshared-ui%2Cs%3Dac76fc702b255a493a5856b5432b92b4%2Cv%3D0100110011010000000111111111001110101101000000000001100"| __truncated__ "15yUK2dU%2B78GK7o587gtwh3i%2ByORXGD8ne5XJBiGkiHpDAJ3%2F7rQ4Gql6T5DqQIwCg%2FSwSioAMIzzaRxGEFKsCkc%2BGohi1fdWhbR0urah6%2BJikm9lA6"| __truncated__ "01999b7023d69473f53740d0f7f2969d9d79e1a18c7e259f6baf643ce642a330fc0a3604d7" "01999b7023960237ab42ec3f429e5a452fe3559d683a090b19a65cf66ce0c01bc21bdb29bf78f030d36d4eeff4dec21ff185c54b06" ...
..$ content : raw [1:45857717] 69 64 2c 6e ...
..$ date : POSIXct[1:1], format: "2016-03-18 18:37:49"
..$ times : Named num [1:6] 0 0 0.062 0.156 27.425 ...
.. ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
..$ request :List of 7
.. ..$ method : chr "GET"
.. ..$ url : chr "https://somewebsite.org/report/groups"
.. ..$ headers : Named chr "application/json, text/xml, application/xml, */*"
.. .. ..- attr(*, "names")= chr "Accept"
.. ..$ fields : NULL
.. ..$ options :List of 4
.. .. ..$ useragent : chr "libcurl/7.43.0 r-curl/0.9.6 httr/1.0.0"
.. .. ..$ cainfo : chr "C:/Users/Thisuser/Documents/R/win-library/3.2/httr/cacert.pem"
.. .. ..$ autoreferer : int 1
.. .. ..$ customrequest: chr "GET"
.. ..$ auth_token: NULL
.. ..$ output : list()
.. .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
.. ..- attr(*, "class")= chr "request"
..$ handle :Class 'curl_handle' <externalptr>
..- attr(*, "class")= chr "response"
$ html :<environment: 0x000000001aad2f60>
- attr(*, "class")= chr "session"
The data are stored in the list item named "content". read_csv
from the "readr" package should be able to read it directly.
Try the following:
library(httr)
library(readr)
read_csv(my.data$content)
这篇关于Web锁定CSV到r中的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!