使用R来“点击".网页上的下载文件按钮 [英] Using R to "click" a download file button on a webpage
问题描述
我正在尝试使用此网页 http://volcano.si.edu/search_eruption.cfm 抓取数据.有两个下拉框要求过滤数据.我不需要过滤的数据,因此我将其保留为空白,然后单击"搜索爆发"继续进入下一页.
I am attempting to use this webpage http://volcano.si.edu/search_eruption.cfm to scrape data. There are two drop-down boxes that ask for filters of the data. I do not need filtered data, so I leave those blank and continue on to the next page by clicking "Search Eruptions".
不过,我注意到的是,结果表只包含少量的列(仅5个),而应包含的列总数(总共24个).但是,如果单击"将结果下载到Excel "按钮并打开下载的文件,则所有24列都将存在.这就是我所需要的.
What I have noticed, though, is that the resulting table only includes a small amount of columns (only 5) compared to the total amount of columns (total of 24) it should have. However, all 24 columns will be there if you click the "Download Results to Excel" button and open the downloaded file. This is what I need.
因此,这似乎已从一项抓取练习(使用httr和rvest)转变为更困难的事情.但是,我很困惑如何使用R来真正点击""将结果下载到Excel "按钮.我的猜测是我将不得不使用RSelenium,但这是我尝试使用的代码httr与POST一起使用,以防任何人都可以找到更简单的方法.我也尝试过使用gdata,data.table,XML等无济于事,这可能只是用户错误造成的.
So, it looks like this has turned from a scraping exercise (using httr and rvest) into something more difficult. However, I'm stumped on how to actually "click" on the "Download Results to Excel" button using R. My guess is I will have to use RSelenium, but here is my code trying to use httr with POST in case there is an easier way that any of you kind people can find. I've also tried using gdata, data.table, XML, etc. to no avail which could just be a result of user error.
此外,知道不能右键单击下载按钮以显示URL可能会有所帮助.
Also, it might be helpful to know that the download button cannot be right-clicked to show a URL.
url <- "http://volcano.si.edu/search_eruption_results.cfm"
searchcriteria <- list(
eruption_category = "",
country = ""
)
mydata <- POST(url, body = "searchcriteria")
使用我的浏览器中的检查器,我能够看到两个过滤器分别是"eruption_category"和"country",并且它们都为空,因为我不需要任何过滤数据.
Using the Inspector in my browser, I was able to see that the two filters are "eruption_category" and "country" and both will be blank since I do not need any filtered data.
最后,似乎上面的代码使我进入了只有5列表格的页面.但是,我仍然无法在下面的代码中使用rvest刮擦该表(使用SelectorGadget刮擦仅一列).最后,这部分并没有多大关系,因为正如我上面所说,我需要全部24列,而不仅仅是这5列.但是,如果您也发现我在下面所做的任何错误,我将不胜感激.
Lastly, it would seem that the above code will get me on to the page that has the table with only 5 columns. However, I was still unable to scrape this table using rvest in the code below (using SelectorGadget to scrape just one column). In the end, this part doesn't matter as much because, as I had said above, I need all 24 columns, not just these 5. But, if you find any errors with what I did below as well, I would be grateful.
Eruptions <- mydata %>%
read_html() %>%
html_nodes(".td8") %>%
html_text()
Eruptions
感谢您提供的任何帮助.
Thank you for any help you can provide.
推荐答案
只需模仿它所做的POST
:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
POST("http://volcano.si.edu/search_eruption_results.cfm",
body = list(bp = "", `eruption_category[]` = "", `country[]` = "", polygon = "", cp = "1"),
encode = "form") -> res
content(res, as="parsed") %>%
html_nodes("div.DivTableSearch") %>%
html_nodes("div.tr") %>%
map(html_children) %>%
map(html_text) %>%
map(as.list) %>%
map_df(setNames, c("volcano_name", "subregion", "eruption_type",
"start_date", "max_vei", "X1")) %>%
select(-X1)
## # A tibble: 750 × 5
## volcano_name subregion eruption_type start_date
## <chr> <chr> <chr> <chr>
## 1 Chirinkotan Kuril Islands Confirmed Eruption 2016 Nov 29
## 2 Zhupanovsky Kamchatka Peninsula Confirmed Eruption 2016 Nov 20
## 3 Kerinci Sumatra Confirmed Eruption 2016 Nov 15
## 4 Langila New Britain Confirmed Eruption 2016 Nov 3
## 5 Cleveland Aleutian Islands Confirmed Eruption 2016 Oct 24
## 6 Ebeko Kuril Islands Confirmed Eruption 2016 Oct 20
## 7 Ulawun New Britain Confirmed Eruption 2016 Oct 11
## 8 Karymsky Kamchatka Peninsula Confirmed Eruption 2016 Oct 5
## 9 Ubinas Peru Confirmed Eruption 2016 Oct 2
## 10 Rinjani Lesser Sunda Islands Confirmed Eruption 2016 Sep 27
## # ... with 740 more rows, and 1 more variables: max_vei <chr>
我认为可以推断出"Excel"部分,但如果不能,则可以这样:
I assumed the "Excel" part could be inferred, but if not:
POST("http://volcano.si.edu/search_eruption_excel.cfm",
body = list(`eruption_category[]` = "",
`country[]` = ""),
encode = "form",
write_disk("eruptions.xls")) -> res
这篇关于使用R来“点击".网页上的下载文件按钮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!