使用POST刮除R中的动态表 [英] Scraping dynamic table in R with POST

查看:64
本文介绍了使用POST刮除R中的动态表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用R刮此表到目前为止,使用下面的代码,我仅设法获得了27行.我想找回所有条目,理想情况下,修改请求,以便我可以选择某些年份等.关于SO的其他问题针对的情况略有不同,我希望将其保留在rvest-xml2-httr世界中,如果可能的话.

I'm trying to scrape this table using R. So far, I've managed to get only 27 lines of it, using the code below. I would like to get all the entries back and, ideally, modify the request so that I can select certain years etc. Other questions on SO target slightly different situations, and I would like to keep this in the rvest-xml2-httr world, if possible.

url <- "http://myfwc.com/wildlifehabitats/managed/alligator/harvest/data-export/"


view <- httr::POST(url) %>% 
  xml2::read_html() %>% 
  rvest::html_nodes("input[name='__VIEWSTATE']") %>% 
  rvest::html_attr("value")

param <- list(`__EVENTTARGET` =     "",
               `__EVENTARGUMENT` =  "",
               `__VIEWSTATE` = view,
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$RefreshButton` = "",
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_Year` = "",
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaNumber` = "",
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaName` =   "",
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox` = "10000",
               `ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState` = "",
               `ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_rfltMenu_ClientState` = "",
               `ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ClientState` =    "",
               `__VIEWSTATEGENERATOR` = "CA0B0334")

request <- httr::POST(url,
                       body = param,
                       encode = 'form') %>% 
  xml2::read_html() %>% 
  rvest::html_table(fill = T)

tib <- request[[1]]

> dim(tib)
[1] 27  9

推荐答案

相关表格具有导出为CSV"链接:

The table in question has a "Export to CSV" link:

如果单击它,则直接得到6.36MB CSV文件,这很好.我假设您需要/想要以编程方式执行此操作,所以这对我有用:

If you click on it, you get the 6.36MB CSV file directly, which is good. I'm assuming that you need/want to do this programmatically, so this worked for me:

  1. 我正在使用Firefox,但Chrome具有类似的功能:检查器.我打开了它( Ctrl - Shift - I ),然后转到网络"标签.
  2. 单击导出为CSV"按钮.您应该在检查器框架中看到一个新的"POST"行.完成后...
  3. 右键单击"POST"行,然后选择"Copy POST Data";这提供了:

  1. I'm using Firefox, but Chrome has a similar capability: Inspector. I opened it (Ctrl-Shift-I) and went to the "Network" tab.
  2. Click on the "Export to CSV" button. You should see a new "POST" line in the inspector frame. When it's complete ...
  3. Right-click on the "POST" line and select "Copy POST Data"; this provides:

__EVENTTARGET
__EVENTARGUMENT
__VIEWSTATE=...
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$ExportToCsvButton=+
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_Year
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaNumber
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaName
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox=20
ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState
ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_rfltMenu_ClientState
ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ClientState
__VIEWSTATEGENERATOR=CA0B0334

(我将长的base64字符串替换为"...".)值得注意的是第四行,以$ExportToCsvButton=+结尾.这是您需要包含在POST数据(param)中的参数.

(I replaced the long base64-string with "...".) The notable line is the fourth, ending in $ExportToCsvButton=+. This is the parameter you need to include in your POST data (param).

中使用代码,并包括定义param,继续:

Using your code above up through and including defining param, continue with:

param$`ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$ExportToCsvButton` <- "+"
request <- httr::POST(url, body = param, encode = 'form')

您现在将拥有:

request
# Response [http://myfwc.com/wildlifehabitats/managed/alligator/harvest/data-export/]
#   Date: 2017-06-01 18:09
#   Status: 200
#   Content-Type: text/csv; charset-UTF-8;
#   Size: 6.36 MB
# <U+FEFF>"Year","Area Number","Area Name","Carcass Size","Harvest Date","Location"
# "2000","101","LAKE PIERCE","11 ft. 5 in.","09-22-2000",""
# "2000","101","LAKE PIERCE","9 ft. 0 in.","10-02-2000",""
# "2000","101","LAKE PIERCE","8 ft. 10 in.","10-06-2000",""
# "2000","101","LAKE PIERCE","8 ft. 0 in.","09-25-2000",""
# "2000","101","LAKE PIERCE","8 ft. 0 in.","10-07-2000",""
# "2000","101","LAKE PIERCE","8 ft. 0 in.","09-22-2000",""
# "2000","101","LAKE PIERCE","7 ft. 2 in.","09-21-2000",""
# "2000","101","LAKE PIERCE","7 ft. 1 in.","09-21-2000",""
# "2000","101","LAKE PIERCE","6 ft. 11 in.","09-25-2000",""
# ...

附带说明:网站以Unicode字符<U+FEFF>开头文件.这样会抛出read.csv,并为您提供X.U.FEFF.Year的列名,完全是修饰性的.

Side note: the website starts the file with <U+FEFF>, a unicode character. This throws off read.csv and gives you a column name of X.U.FEFF.Year, is entirely cosmetic.

如果您不在乎建议的文件名,只需执行

If you don't care about the suggested filename, you can simply do

write(as.character(request), file="quux.csv")

如果要使用网站建议的文件名,可以通过以下方式找到它:

If you want to use the filename the website suggests for it, you can find it with:

httr::headers(request)$`content-disposition`
# [1] "inline;filename=\"FWCAlligatorHarvestData.csv\""

应该直接进行解析.

如果您不想/不需要保存到中间文件,则始终可以立即使用它:

If you don't want/need to save to an intermediate file, you can always consume it immediately:

head(read.csv(textConnection(as.character(request))))
# Invalid encoding : defaulting to UTF-8.
#   X.U.FEFF.Year Area.Number   Area.Name Carcass.Size Harvest.Date Location
# 1          2000         101 LAKE PIERCE 11 ft. 5 in.   09-22-2000         
# 2          2000         101 LAKE PIERCE  9 ft. 0 in.   10-02-2000         
# 3          2000         101 LAKE PIERCE 8 ft. 10 in.   10-06-2000         
# 4          2000         101 LAKE PIERCE  8 ft. 0 in.   09-25-2000         
# 5          2000         101 LAKE PIERCE  8 ft. 0 in.   10-07-2000         
# 6          2000         101 LAKE PIERCE  8 ft. 0 in.   09-22-2000         

这篇关于使用POST刮除R中的动态表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆