以编程方式在R内抓取响应标头 [英] Programmatically scraping a response header within R

查看:92
本文介绍了以编程方式在R内抓取响应标头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图仅使用R及其基于curl的网络抓取库来访问以下屏幕快照中突出显示的响应头:位置文本。您可以通过访问



我相信获取有效cookie的唯一方法是使用 library(curlconverter)(请参阅



如果我们查看 AJDocumentation.jsp 源,它使用jQuery $。get 进行这些调用:

  $。get( http://ipinfo.io?token=xxxxxxxxxxxxxx,函数(响应){
var geodatos = encodeURIComponent(response.ip + \t + response.country + \t + response.postal + \t +
response.loc + \t + response.region + \t + response.city + \t +
response.org);

$ .get( jdsStatJD.jsp?ID = + geodatos +
& url = http%3A%2F%2Fwww.worldvaluessurvey .org%2FAJDocumentation.jsp& referer = null& cms = Documentation,
函数(resp2){
});
}, jsonp);

然后,在下面的几个调用中,我们可以看到成功的 POST / AJDownload .jsp 的状态为 302临时移动,并且其响应标头中包含想要的 Location



  HTTP /1.1 302临时移动
内容长度:0
内容类型:text / html
位置:http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18 .zip
服务器:Microsoft-IIS / 7.5
X-Powered-By:ASP.NET
日期:2016年12月1日,星期四16:24:37 GMT

因此,这是此站点的安全机制。它使用 ipinfo.io 来存储有关访问者的IP,位置甚至ISP组织的信息,就在用户即将访问之前。通过单击链接来启动下载。接收这些数据的脚本是 /jdsStatJD.jsp 。我没有使用ipinfo.io,也没有使用该服务的API密钥(已将其隐藏在屏幕截图中),而是创建了一个虚拟的有效数据序列,仅用于验证请求。完全不需要受保护文件的邮寄表格数据。可以下载文件而无需发布这些数据。



此外,不需要 curlconverter 库。我们要做的就是使用 httr GET POST 请求c $ c>库。我想指出的一个重要部分是,为了防止 httr POST 函数跟随<$ c在上次调用中收到的$ c> Location 标头为 302 头,我们需要使用配置设置 config(followlocation = FALSE)当然,这将阻止它遵循 Location 并让我们获取 Location 从标题开始。



输出



我的R脚本可以运行从命令行开始,它可以接受 DOID 数值作为参数来获取所需的文件。例如,如果我们要获取文件 WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18 的链接,则必须添加其 DOID 为3724 )到使用 Rscript 命令调用脚本的末尾:

  Rscript wvs_fetch_downloads.r 3724 
[1] http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip

我创建了一个R函数,只需传递即可获取所需的每个文件位置DOID

  getFileById<-function(fileId)

您可以删除命令行参数解析并通过直接传递 DOID 来使用该函数:

  #args<-commandArgs(TRUE)
#if(length(args)== 0){
#打印(未指定文件ID。使用'./script.r ####'。)
#退出(否)
#}

#fileId<-ar gs [1]
fileId<- 3724

#DOID = 3843:WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22(Excel)
#DOID = 3844:WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
#DOID = 3725:WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
#DOID = 3996:WVS_Longitudinal_1981-2014_sas_v_2015_04_18
#DOID = 3723:WVS_Longitudinal_1981_b_ID = 37_ps#WID_v_Longitudinal_1981_2014_b 2014_stata_dta_v_2015_04_18

getFileById(fileId)

最终R工作脚本

 库(httr)

getFileById<-函数(fileId){
response<-GET(
url = http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1,
add_headers(
ʻAccept` = text / html,application / xhtml + xml,application / xml; q = 0.9,image / webp,* / *; q = 0.8,
ʻAccept-Encoding` = gzip,deflate,
ʻ接受语言` = en-US,en; q = 0.8,
`Cache-Control` = max-age = 0,
`Connection` = keep-alive,
` Host` = www.worldvaluessurvey.org,
ʻUser-Agent` = Mozilla / 5.0(Windows NT 10.0; WOW64; rv:50.0)Gecko / 20100101 Firefox / 50.0,
`Content-type` = application / x-www-form-urlencoded,
`Referer` = http://www.worldvaluessurvey .org / AJDownloadLicense.jsp,
ʻUpgrade-Insecure-Requests` = 1))

set_cookie<-标头(响应)$`set-cookie`
cookie<-strsplit(set_cookie,';')
cookie<-cookie [[1]] [1]

响应<-GET(
url = http://www.worldvaluessurvey.org/jdsStatJD.jsp?ID=2.72.48.149%09IT%09undefined%0941.8902%2C12.4923%09Lazio%09Roma%09Orange%20SA%20Telecommunications%20Corporation&url=http%3A%2F %2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp& referer = null& cms = Documentation,
add_headers(
ʻAccept` = * / *,
ʻAccept-Encoding` = gzip,deflate,
ʻAccept-Language` = zh-CN,en; q = 0.8,
`Cache-Control` = max-age = 0,
`Connection` = keep-alive,
`X-Requested-With` = XMLHttpRequest,
`Host` = www.worldvaluessurvey.org,
`User-Agent` = Mozilla / 5.0(Windows NT 10.0; WOW64; rv:50.0)Gecko / 20100101 Firefox / 50.0,
`Content-type` = application / x-www-form-urlencoded,
`Referer` = http://www.worldvaluessurvey .org / AJDocumentation.jsp?CndWAVE = -1,
`Cookie` = cookie))

post_data<-list(
ulthost = WVS,
CMSID =,
CndWAVE = -1,
SAID = -1,
DOID = fileId,
AJArchive = WVS数据存档,
EdFunction =,
DOP =,
PUB =)

响应<-POST(
url = http:/ /www.worldvaluessurvey.org/AJDownload.jsp\",
config(followlocation = FALSE),
add_headers(
ʻAccept` = * / *,
ʻAccept- Encoding` = gzip,deflate,
ʻAccept-Language` = en-US,en; q = 0.8,
`Cache-Control` = max-age = 0,
`Connection` = keep-alive,
`Host` = www.worldvalu essurvey.org,
ʻUser-Agent` = Mozilla / 5.0(Windows NT 10.0; WOW64; rv:50.0)Gecko / 20100101 Firefox / 50.0,
`Content-type` = application / x-www-form-urlencoded,
`Referer` = http://www.worldvaluessurvey .org / AJDocumentation.jsp?CndWAVE = -1,
`Cookie` = cookie),
body = post_data,
encoding = form)

位置<-标头(响应)$位置
位置
}

args<-commandArgs(TRUE)
if(length(args)== 0) {
print(未指定文件ID。使用'./script.r ####'。)
quit( no)
}

fileId<-args [1]

#DOID = 3843:WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22(Excel)
#DOID = 3844:WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25(Excel)
#DOID = 3725:WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
#DOID = 3996:WVS_Longitudinal_1981-2014_sas_v_2015_04_18
#DOID = 3723:WVS_Longitudinal_1981-2014_spss_ID_b_b_b_DO _ $ _ b_b_DO _ $ _ b_b_DO _ = _ B_b_DO _ = _ 37_b a_v_2015_04_18

getFileById(fileId)


I am trying to access the highlighted response header: location text in the screenshot below using only R and its curl-based webscraping libraries. one can easily get to this point in any web browser by visiting http://www.worldvaluessurvey.org/WVSDocumentationWVL.jsp, clicking on the download for any of the data files, and filling out the agreement form. The download begins automatically in a web browser.

I believe that the only way to obtain a valid cookie is with library(curlconverter) (see How to download a file behind a semi-broken javascript asp function with R) but that answer does not appear to be enough to programmatically determine the http url of the file, only to download the zipped file once it's already known.

I've pasted some code below with different httr and curlconverter code that I've played around with, but I'm missing something here. Again, the only goal is to programmatically determine the highlighted text entirely within R (cross-platform).

library(curlconverter)
library(httr)

browserPOST <-
    "curl 'http://www.worldvaluessurvey.org/AJDownload.jsp'
    -H 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
    -H 'Accept-Encoding:gzip, deflate'
    -H 'Accept-Language:en-US,en;q=0.8'
    -H 'Cache-Control:max-age=0'
    --compressed -H 'Connection:keep-alive'
    -H 'Content-Length:188'
    -H 'Content-Type:application/x-www-form-urlencoded'
    -H 'Cookie:ASPSESSIONIDCASQAACD=IBLGBFOAEHFILMMJJCFEOEMI; JSESSIONID=50DABDEDD0B2FC370C415B4BD1855260; __atuvc=13%7C45; __atuvs=58224f37d312c42400c'
    -H 'Host:www.worldvaluessurvey.org'
    -H 'Origin:http://www.worldvaluessurvey.org'
    -H 'Referer:http://www.worldvaluessurvey.org/AJDownloadLicense.jsp'
    -H 'Upgrade-Insecure-Requests:1'
    -H 'User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'"

form_data <-
    list( 
        ulthost = "WVS" ,
        CMSID = "" ,
        LITITLE = "" ,
        LINOMBRE = "fas" ,
        LIEMPRESA = "asf" ,
        LIEMAIL = "asdf" ,
        LIPROJECT = "asfd" ,
        LIUSE = "1" ,
        LIPURPOSE = "asdf" ,
        LIAGREE = "1" ,
        DOID = "3996" ,
        CndWAVE = "-1" ,
        SAID = "-1" ,
        AJArchive = "WVS Data Archive" ,
        EdFunction = "" ,
        DOP = "" 
    )   



getDATA <- (straighten(browserPOST) %>% make_req)[[1]]()

a <- VERB(verb = "POST", url = "http://www.worldvaluessurvey.org/AJDownload.jsp", 
    httr::add_headers(Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
        `Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8", 
        `Cache-Control` = "max-age=0", Connection = "keep-alive", 
        `Content-Length` = "188", Host = "www.worldvaluessurvey.org", 
        Origin = "http://www.worldvaluessurvey.org", Referer = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp", 
        `Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"), 
    httr::set_cookies(`Cookie:ASPSESSIONIDCASQAACD` = "IBLGBFOAEHFILMMJJCFEOEMI", 
        JSESSIONID = "50DABDEDD0B2FC370C415B4BD1855260", `__atuvc` = "13%7C45", 
        `__atuvs` = "58224f37d312c42400c"), encode = "form",body=form_data)

解决方案

This was a nice challenge!

The problem is not related to R language. We'll have the same result in any language if we just try to post some data to the download script. We have to deal with some kind of security "pattern" here. The site restricts users from retrieving the files urls and it asks them to fill forms with data in order to provide those links. If a browser can retrieve these links, then we can too by writing the proper HTTP calls. Thing is, we need to know exactly which calls we have to make. In order to find that, we need to see the individual calls the site does whenever someone clicks to download. Here is what I found a few calls before a successful 302 AJDownload.jsp POST call:

We can see it clearly, if we look at the AJDocumentation.jsp source, it makes these calls by using jQuery $.get:

$.get("http://ipinfo.io?token=xxxxxxxxxxxxxx", function (response) {
    var geodatos=encodeURIComponent(response.ip+"\t"+response.country+"\t"+response.postal+"\t"+
    response.loc+"\t"+response.region+"\t"+response.city+"\t"+
    response.org);

    $.get("jdsStatJD.jsp?ID="+geodatos+
        "&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation",
        function (resp2) {
    });
}, "jsonp");

Then, a few calls below, we can see the successful POST /AJDownload.jsp with status 302 Moved Temporarily and with the wanted Location in its response headers:

HTTP/1.1 302 Moved Temporarily
Content-Length: 0
Content-Type: text/html
Location: http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Thu, 01 Dec 2016 16:24:37 GMT

So, this is the security mechanism of this site. It uses ipinfo.io to store visitor informations about their IP, Location and even the ISP organization, just before the user is about to initiate a download by clicking on a link. The script which receives these data, is the /jdsStatJD.jsp. I haven’t used ipinfo.io, nor their API key for this service (have it hidden on my screenshots) and instead I created a dummy valid sequence of data, just to validate the request. The post form data for the "protected" files are not require at all. It is possible to download the files without posting these data.

Also, the curlconverter library is not required. All we have to do, is simple GET and POST requests by using httr library. One important part I want to point out, is that in order to prevent httr POST function from following the Location header received with 302 status at our last call, we need to use the config setting config(followlocation = FALSE) which of course will prevent it from following the Location and let us fetch the Location from the headers.

OUTPUT

My R script can be run from the command line and it can accept DOID numeric values for parameters to get the file needed. For example, if we want to get the link for the file WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18, then we have to add its DOID (which is 3724) to the end of our script when calling it using the Rscript command:

Rscript wvs_fetch_downloads.r 3724
[1] "http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip"

I have created an R function to get each file location you want by just passing the DOID:

getFileById <- function(fileId)

You can remove the command line argument parsing and use the function by passing the DOID directly:

#args <- commandArgs(TRUE)
#if(length(args) == 0) {
#   print("No file id specified. Use './script.r ####'.")
#   quit("no")
#}

#fileId <- args[1]
fileId <- "3724"

# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18

getFileById(fileId)

Final R working script

library(httr)

getFileById <- function(fileId) {
    response <- GET(
        url = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1", 
        add_headers(
            `Accept` = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive", 
            `Host` = "www.worldvaluessurvey.org", 
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp", 
            `Upgrade-Insecure-Requests` = "1"))

    set_cookie <- headers(response)$`set-cookie`
    cookies <- strsplit(set_cookie, ';')
    cookie <- cookies[[1]][1]

    response <- GET(
        url = "http://www.worldvaluessurvey.org/jdsStatJD.jsp?ID=2.72.48.149%09IT%09undefined%0941.8902%2C12.4923%09Lazio%09Roma%09Orange%20SA%20Telecommunications%20Corporation&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation", 
        add_headers(
            `Accept` = "*/*", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive", 
            `X-Requested-With` = "XMLHttpRequest",
            `Host` = "www.worldvaluessurvey.org", 
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
            `Cookie` = cookie))

    post_data <- list( 
        ulthost = "WVS",
        CMSID = "",
        CndWAVE = "-1",
        SAID = "-1",
        DOID = fileId,
        AJArchive = "WVS Data Archive",
        EdFunction = "",
        DOP = "",
        PUB = "")  

    response <- POST(
        url = "http://www.worldvaluessurvey.org/AJDownload.jsp", 
        config(followlocation = FALSE),
        add_headers(
            `Accept` = "*/*", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive",
            `Host` = "www.worldvaluessurvey.org",
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
            `Cookie` = cookie),
        body = post_data,
        encode = "form")

    location <- headers(response)$location
    location
}

args <- commandArgs(TRUE)
if(length(args) == 0) {
    print("No file id specified. Use './script.r ####'.")
    quit("no")
}

fileId <- args[1]

# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18

getFileById(fileId)

这篇关于以编程方式在R内抓取响应标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆