从R中的aspx网页下载文档 [英] Download documents from aspx web page in R

查看:67
本文介绍了从R中的aspx网页下载文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试自动下载Oil& amp;的文档科罗拉多州油气保护委员会(COGCC)的气井使用R中的"rvest"和"downloader"软件包.

I'm trying to automatically download documents for Oil & Gas wells from the Colorado Oil and Gas Conservation Commission (COGCC) using the "rvest" and "downloader" packages in R.

到包含特定孔的文档的表格/表单的链接是; http://ogccweblink.state.co.us/results.aspx?id=12337064

The link to the table/form that contains the documents for a particular well is; http://ogccweblink.state.co.us/results.aspx?id=12337064

"id = 12337064"是孔的唯一标识符

The "id=12337064" is the unique identifier for the well

可以通过单击下载表单页面上的文档. 下面是一个示例. http://ogccweblink.state.co.us/DownloadDocument.aspx?DocumentId=3172781

The documents on the form page can be downloaded by clicking them. An example is below. http://ogccweblink.state.co.us/DownloadDocument.aspx?DocumentId=3172781

"DocumentID = 3172781"是要下载的文档的唯一文档ID.在这种情况下,xlsm文件.文档页面上的其他文件格式包括PDF和xls.

The "DocumentID=3172781" is the unique document ID for the document to be downloaded. In this case, an xlsm file. Other file formats on the document page include PDF and xls.

到目前为止,我已经能够编写代码以下载任何文档,但是仅限于第一页.大多数孔在多页上都有文档,并且我无法在第1页以外的其他页面上下载文档(所有文档页面都有相似的URL)

So far I've been able to write a code to download any document for any well but it is limited to only the first page. Majority of the wells have documents on multiple pages and I'm unable to download documents on pages other than page 1 (all document pages have similar URL)

## Extract the document id for document to be downloaded in this case "DIRECTIONAL DATA". Used the SelectorGadget tool to extract the CSS path
library(rvest)
html <- html("http://ogccweblink.state.co.us/results.aspx?id=12337064")
File <- html_nodes(html, "tr:nth-child(24) td:nth-child(4) a")
File <- as(File[[1]],'character')
DocId<-gsub('[^0-9]','',File)
DocId
[1] "3172781"

## To download the document, I use the downloader package
library(downloader)
linkDocId<-paste('http://ogccweblink.state.co.us/DownloadDocument.aspx DocumentId=',DocId,sep='')
download(linkDocId,"DIRECTIONAL DATA" ,mode='wb')

    trying URL 'http://ogccweblink.state.co.us/DownloadDocument.aspx?DocumentId=3172781'
Content type 'application/octet-stream' length 33800 bytes (33 KB)
downloaded 33 KB

有人知道我如何修改我的代码以下载其他页面上的文档吗?

Does anyone know how I can modify my code to download documents on other pages?

非常感谢!

Em

推荐答案

您必须对第二个查询使用完全相同的cookie,并同时传递viewstate和validate字段.快速示例:

You have to use the very same cookie for the second query and pass the viewstate and validation fields as well. Quick example:

  1. 加载RCurl并加载URL并保留cookie:

  1. Load RCurl and load the URL and preserve the cookie:

url   <- 'http://ogccweblink.state.co.us/results.aspx?id=12337064'
library(RCurl)
curl  <- curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = getCurlHandle())
page1 <- getURL(url, curl = curl)

  • 在解析HTML后提取VIEWSTATEEVENTVALIDATION值:

    page1 <- htmlTreeParse(page1, useInternal = TRUE)
    viewstate  <- xpathSApply(page1, '//input[@name = "__VIEWSTATE"]', xmlGetAttr, 'value')
    validation <- xpathSApply(page1, '//input[@name = "__EVENTVALIDATION"]', xmlGetAttr, 'value')
    

  • 使用保存的cookie再次查询相同的URL,提取隐藏的INPUT值并询问第二页:

  • Query the same URL again with the saved cookie, extracted hidden INPUT values and ask for the second page:

    page2 <- postForm(url, curl = curl,
             .params = list(
                 '__EVENTARGUMENT'   = 'Page$2',
                 '__EVENTTARGET'     = 'WQResultGridView',
                 '__VIEWSTATE'       = viewstate,
                 '__EVENTVALIDATION' = validation))
    

  • 从第二页上显示的表中提取URL:

  • Extract the URLs from the table shown on the second page:

    page2 <- htmlTreeParse(page2, useInternal = TRUE)
    xpathSApply(page2, '//td/font/a', xmlGetAttr, 'href')
    

  • 这篇关于从R中的aspx网页下载文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆