如何使用 RCurl *after* 服务器身份验证下载大型二进制文件 [英] how to download a large binary file with RCurl *after* server authentication

查看:28
本文介绍了如何使用 RCurl *after* 服务器身份验证下载大型二进制文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最初问的是这个问题 关于使用 httr 包执行此任务,但我认为使用 httr 是不可能的.所以我重新编写了我的代码以使用 RCurl 代替 - 但我仍然绊倒了一些可能与 writefunction 相关的东西......但我真的不不明白为什么.

i originally asked this question about performing this task with the httr package, but i don't think it's possible using httr. so i've re-written my code to use RCurl instead -- but i'm still tripping up on something probably related to the writefunction.. but i really don't understand why.

您应该能够使用 32 位版本的 R 重现我的工作,因此如果您将任何内容读入 RAM,就会达到内存限制.我需要一个直接下载到硬盘的解决方案.

you should be able to reproduce my work by using the 32-bit version of R, so you hit memory limits if you read anything into RAM. i need a solution that downloads directly to the hard disk.

开始,这段代码有效——压缩文件被适当地保存到磁盘.

to start, this code to works -- the zipped file is appropriately saved to the disk.

library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://www2.census.gov/acs2011_5yr/pums/csv_pus.zip"
curlPerform(url = url, writedata = f@ref)
close(f)
# 2.1 GB file successfully written to disk

现在这里有一些不起作用的 RCurl 代码.如上一个问题,准确地复制此内容需要在 ipums 上创建摘录.

now here's some RCurl code that does not work. as stated in the previous question, reproducing this exactly will require creating an extract on ipums.

your.email <- "email@address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"

library(RCurl)

values <- 
    list(
        "login[email]" = your.email , 
        "login[password]" = your.password , 
        "login[is_for_login]" = 1
    )

curl = getCurlHandle()

curlSetOpt(
    cookiejar = 'cookies.txt', 
    followlocation = TRUE, 
    autoreferer = TRUE, 
    ssl.verifypeer = FALSE,
    curl = curl
)

params <- 
    list(
        "login[email]" = your.email , 
        "login[password]" = your.password , 
        "login[is_for_login]" = 1
    )

html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)

现在我已登录,尝试与上述相同的命令,但使用 curl 对象来保留 cookie.

and now that i'm logged in, try the same commands as above, but with the curl object to keep the cookies.

filename <- tempfile()
f <- CFILE(filename, mode = "wb")

此行中断--

curlPerform(url = extract.path, writedata = f@ref, curl = curl)
close(f)

# the error is:
Error in curlPerform(url = extract.path, writedata = f@ref, curl = curl) : 
  embedded nul in string: [[binary jibberish here]]

我上一篇文章的答案让我参考了 这个 c 级 writefunction 答案,但我对如何重新创建 curl_writer C 程序一无所知(在 Windows 上?)..

the answer to my previous post referred me to this c-level writefunction answer, but i'm clueless about how to re-create that curl_writer C program (on windows?)..

dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)

..或者为什么它甚至是必要的,因为这个问题顶部的五行代码没有像getNativeSymbolInfo那样疯狂.我只是不明白为什么传入存储身份验证/cookies 的额外 curl 对象并告诉它不要验证 SSL 会导致代码在其他情况下工作......破坏?

..or why it's even necessary, given that the five lines of code at the top of this question work without anything crazy like getNativeSymbolInfo. i just don't understand why passing in that extra curl object that stores the authentication/cookies and tells it not to verify SSL would cause code that otherwise works.. to break?

推荐答案

现在可以使用 httr 包.谢谢哈德利!

this is now possible with the httr package. thanks hadley!

https://github.com/hadley/httr/issues/44

这篇关于如何使用 RCurl *after* 服务器身份验证下载大型二进制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆