Web Scrape:从下拉菜单中选择字段,提取最终数据 [英] Web Scrape: Select Fields from Drop Downs, Extract Resulting Data

查看:127
本文介绍了Web Scrape:从下拉菜单中选择字段,提取最终数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试在R中做一些网页扫描,并可以使用一些帮助。



我想在表格中提取数据在这个页面 http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx



但我想先选择选择县,然后从下一个下拉菜单中选择选择Alameda County(CA),然后刮取表格中的数据。



这是我迄今为止的,但我想我知道为什么它不起作用的表单函数适合填写基本表单,而不是从.aspx(?)的下拉列表中选择。搜索了我想要做的事情的例子,但是却空了。

  library(rvest)
url< - http://droughtmonitor.unl.edu/MapsAndData/DataTables .aspx
pgsession< -html_session(url)
pgform< -html_form(pgsession)[[1]]

filled_form< - set_values(pgform,
`#atype_chosen span` =County,
`#asel_chosen span` =Alameda Count(CA))
submit_form(pgsession,filled_form)

无论如何,这会给我一个错误错误:未知字段名称:#atype_chosen span,#asel_chosen span。我有点得到它......我要求R在没有打开下拉菜单的情况下进入县,而不打开工作。



如果有人能指引我正确的方向,我会很感激。

/ div>

我监控了浏览器在我选择您的县时所做的请求,并使用该信息创建了这些信息。它可以让你得到你的数据,只是以与你如何去做的不同的方式获得你的数据......有效载荷中的area参数是针对不同的县。



update:已添加代码以获取县列表和代码,以便您可以选择任何希望从中获得数据的县...

  library(httr)

#开始获取县和他们的代码...
url< - http://droughtmonitor.unl.edu/Ajax.aspx/ ReturnAOI
头文件< - add_headers(
Accept=application / json,text / javascript,* / *; q = 0.01,
Accept-Encoding=gzip ,缩小,
Accept-Language=zh-CN,en; q = 0.8,
Content-Length=16,
Content-Type= application / json; charset = UTF-8,
Host=droughtmonitor.unl.edu,
Origin=http://droughtmonitor.unl.edu,
Proxy-Connection=keep-alive,
Referer=http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx,
User-Agent= Mozilla / 5.0(Windows NT 6.1; WOW64)AppleWebKit / 537.36(KHTML,如Gecko)Chrome / 48.0.2564.116 Safari / 537.36,
X-Requested-With=XMLHttpRequest

a< - POST url,body ={'aoi':'county'},headers,encode =json)
tmp< - content(a)[[1]]
county_df< - data .frame(text = unname(unlist(sapply(tmp,[,Text))),
value = unname(unlist(sapply(tmp,[,Value))),
stringsAsFactors = FALSE)

#在下面的负载中使用您想要的任何县的代码...

url< - http://diagnosticmonitor。 unl.edu/Ajax.aspx/ReturnTabularDM
有效载荷< - {'area':'06001','type':'county','statstype':'1'}
头< - add_headers(
Host=droughtmonitor.unl.edu,
Proxy-Connection=keep-alive,
Content-Length=50 ,
Accept=application / json,text / javascript,* / *; q = 0.01,
Origin =http://droughtmonitor.unl.edu,
X-Requested-With=XMLHttpRequest,
User-Agent=Mozilla / 5.0(Windows NT 6.1; WOW64)AppleWebKit / 537.36(KHTML,如Gecko)Chrome / 48.0.2564.116 Safari / 537.36,
Content-Type=application / json; charset = UTF-8,
Referer=http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx,
Accept-Encoding=gzip,deflate,
Accept-Language=en-US,en; q = 0.8,
X-Requested-With=XMLHttpRequest

a < - POST url),body = payload,headers,encode =json)
tmp< - content(a)[[1]]
df< - data.frame(date = unname(unlist(sapply (tmp,[,Date))),
d0 = unname(unlist(sapply(tmp,[,D0))),
d1 = unname(unlist(sapply (tmp,[,D1))),
d2 = unname(unlist(sapply(tmp,[,D2))),
d3 = unname(unlist(sapply (tmp,[,D3))),
d4 = unname(unlist(sapply(tmp,[,D4))),
stringsAsFactors = FALSE)


Try to do some webscraping in R and could use some help.

I would like to extract the data in the table at this page http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx

But I would like to first select County from the left most drop down, then select Alameda County (CA) from the next dropdown, then scrape the data in the table.

This is what I have so far, but I think I know why its not working - rvest form functions are suited to filling out a basic form not selecting from drop downs on a .aspx(?). Searched around for examples of what I am trying to do but came up empty.

library(rvest)
url       <-"http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx"       
pgsession <-html_session(url)               
pgform    <-html_form(pgsession)[[1]]       

filled_form <- set_values(pgform,
                      `#atype_chosen span` = "County", 
                      `#asel_chosen span` = "Alameda Count (CA)") 
submit_form(pgsession,filled_form)

Anyway, this gives me an error "Error: Unknown field names: #atype_chosen span, #asel_chosen span". I sort of get it...I am asking R to enter County into the box without opening the drop down which isn't going to work.

If someone could point me in the right direction, I'd appreciate it.

解决方案

I monitored the requests the browser made when I selected your county and used that info to create this. It's gets you your data, just in a different way from how you went about it... The area parameter in the payload is for different counties.

update: I've added the code to get the county list and codes so you can select whatever county you want to get the data from...

library("httr")

# start by getting the counties and their codes...
url <- "http://droughtmonitor.unl.edu/Ajax.aspx/ReturnAOI"
headers <- add_headers(
  "Accept" = "application/json, text/javascript, */*; q=0.01",
  "Accept-Encoding" = "gzip, deflate",
  "Accept-Language" = "en-US,en;q=0.8",
  "Content-Length" = "16",
  "Content-Type" = "application/json; charset=UTF-8",
  "Host" = "droughtmonitor.unl.edu",
  "Origin" = "http://droughtmonitor.unl.edu",
  "Proxy-Connection" = "keep-alive",
  "Referer" = "http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx",
  "User-Agent" = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36",
  "X-Requested-With" = "XMLHttpRequest"
)
a <- POST(url, body="{'aoi':'county'}", headers, encode="json")
tmp <- content(a)[[1]]
county_df <- data.frame(text=unname(unlist(sapply(tmp, "[", "Text"))),
                  value=unname(unlist(sapply(tmp, "[", "Value"))),
                  stringsAsFactors=FALSE)

# use the code for whatever county you want in the payload below...

url <- "http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM"
payload <- "{'area':'06001', 'type':'county', 'statstype':'1'}"
headers <- add_headers(
                "Host" = "droughtmonitor.unl.edu",
                "Proxy-Connection" = "keep-alive",
                "Content-Length" = "50",
                "Accept" = "application/json, text/javascript, */*; q=0.01",
                "Origin" = "http://droughtmonitor.unl.edu",
                "X-Requested-With" = "XMLHttpRequest",
                "User-Agent" = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36",
                "Content-Type" = "application/json; charset=UTF-8",
                "Referer" = "http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx",
                "Accept-Encoding" = "gzip, deflate",
                "Accept-Language" = "en-US,en;q=0.8",
                "X-Requested-With" = "XMLHttpRequest"
)
a <- POST(url, body=payload, headers, encode="json")
tmp <- content(a)[[1]]
df <- data.frame(date=unname(unlist(sapply(tmp, "[", "Date"))),
                 d0=unname(unlist(sapply(tmp, "[", "D0"))),
                 d1=unname(unlist(sapply(tmp, "[", "D1"))),
                 d2=unname(unlist(sapply(tmp, "[", "D2"))),
                 d3=unname(unlist(sapply(tmp, "[", "D3"))),
                 d4=unname(unlist(sapply(tmp, "[", "D4"))),
                 stringsAsFactors=FALSE)

这篇关于Web Scrape:从下拉菜单中选择字段,提取最终数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆