使用 R 从 aspx 网站抓取 [英] Scraping from aspx website using R
问题描述
我正在尝试使用 R 完成一项任务来抓取网站上的数据.
I am trying to accomplish a task using R to scrape data on a website.
我想浏览以下页面上的每个链接:http:///capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House 账单
仅选择当前状态显示已传输给州长"的项目.例如,http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013
Select only items with Current Status showing "transmitted to the governor". For example, http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013
然后为以下子句通过最终阅读"删除 STATUS TEXT 中的单元格.例如:通过 SD 2 中修正的最终阅读,代表 Fale、Jordan、Tsuji 有保留地投票赞成;代表 Cabanilla、Morikawa、Oshiro、Tokioka 投了反对票(4),没有人有理由(0).
And then scrapping the cells within STATUS TEXT for the following clause" Passed Final Reading". For example: Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0).
我曾尝试使用包含 Rcurl 和 XML 包(在 R 中)的先前示例,但我不知道如何将它们正确用于 aspx 站点.所以我想要的是: 1. 关于如何构建这样的代码的一些建议.2. 以及如何学习执行此类任务所需的知识的建议.
I have tried using previous examples with packages Rcurl and XML (in R), but I don't know how to use them correctly for aspx sites. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.
感谢您的帮助,
汤姆
推荐答案
require(httr)
require(XML)
basePage <- "http://capitol.hawaii.gov"
h <- handle(basePage)
GET(handle = h)
res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House")
# parse content for "Transmitted to Governor" text
resXML <- htmlParse(content(res, as = "text"))
resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]')
appRows <-sapply(resTable, xmlValue)
include <- grepl("Transmitted to Governor", appRows)
resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href')
appUrls <- resUrls[include]
# look at just the first
res <- GET(handle = h, path = appUrls[1])
resXML <- htmlParse(content(res, as = "text"))
xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan,
Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro,
Tokioka voting no (4) and none excused (0)."
通过设置handle
让包httr
处理所有后台工作.
Let package httr
handle all the background work by setting up a handle
.
如果您想遍历所有 92 个链接:
If you want to run over all 92 links:
# get all the links returned as a list (will take sometime)
# print statement included for sanity
res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x)));
GET(handle = h, path = x)})
resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))})
appString <- sapply(resXML, function(x){
xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)
})
head(appString)
> head(appString)
$href
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)."
$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)."
$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)."
$href
[1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 1 Excused: Ige."
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)."
$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)."
$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)."
这篇关于使用 R 从 aspx 网站抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!