使用R来“点击".网页上的下载文件按钮 [英] Using R to "click" a download file button on a webpage

查看:102
本文介绍了使用R来“点击".网页上的下载文件按钮的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用此网页 http://volcano.si.edu/search_eruption.cfm 抓取数据.有两个下拉框要求过滤数据.我不需要过滤的数据,因此我将其保留为空白,然后单击"搜索爆发"继续进入下一页.

I am attempting to use this webpage http://volcano.si.edu/search_eruption.cfm to scrape data. There are two drop-down boxes that ask for filters of the data. I do not need filtered data, so I leave those blank and continue on to the next page by clicking "Search Eruptions".

不过,我注意到的是,结果表只包含少量的列(仅5个),而应包含的列总数(总共24个).但是,如果单击"将结果下载到Excel "按钮并打开下载的文件,则所有24列都将存在.这就是我所需要的.

What I have noticed, though, is that the resulting table only includes a small amount of columns (only 5) compared to the total amount of columns (total of 24) it should have. However, all 24 columns will be there if you click the "Download Results to Excel" button and open the downloaded file. This is what I need.

因此,这似乎已从一项抓取练习(使用httr和rvest)转变为更困难的事情.但是,我很困惑如何使用R来真正点击""将结果下载到Excel "按钮.我的猜测是我将不得不使用RSelenium,但这是我尝试使用的代码httr与POST一起使用,以防任何人都可以找到更简单的方法.我也尝试过使用gdata,data.table,XML等无济于事,这可能只是用户错误造成的.

So, it looks like this has turned from a scraping exercise (using httr and rvest) into something more difficult. However, I'm stumped on how to actually "click" on the "Download Results to Excel" button using R. My guess is I will have to use RSelenium, but here is my code trying to use httr with POST in case there is an easier way that any of you kind people can find. I've also tried using gdata, data.table, XML, etc. to no avail which could just be a result of user error.

此外,知道不能右键单击下载按钮以显示URL可能会有所帮助.

Also, it might be helpful to know that the download button cannot be right-clicked to show a URL.

url <- "http://volcano.si.edu/search_eruption_results.cfm"

searchcriteria <- list(
    eruption_category = "",
    country = ""
)

mydata <- POST(url, body = "searchcriteria")

使用我的浏览器中的检查器,我能够看到两个过滤器分别是"eruption_category"和"country",并且它们都为空,因为我不需要任何过滤数据.

Using the Inspector in my browser, I was able to see that the two filters are "eruption_category" and "country" and both will be blank since I do not need any filtered data.

最后,似乎上面的代码使我进入了只有5列表格的页面.但是,我仍然无法在下面的代码中使用rvest刮擦该表(使用SelectorGadget刮擦仅一列).最后,这部分并没有多大关系,因为正如我上面所说,我需要全部24列,而不仅仅是这5列.但是,如果您也发现我在下面所做的任何错误,我将不胜感激.

Lastly, it would seem that the above code will get me on to the page that has the table with only 5 columns. However, I was still unable to scrape this table using rvest in the code below (using SelectorGadget to scrape just one column). In the end, this part doesn't matter as much because, as I had said above, I need all 24 columns, not just these 5. But, if you find any errors with what I did below as well, I would be grateful.

Eruptions <- mydata %>%
    read_html() %>%
    html_nodes(".td8") %>%
    html_text()
Eruptions

感谢您提供的任何帮助.

Thank you for any help you can provide.

推荐答案

只需模仿它所做的POST:

library(httr)
library(rvest)
library(purrr)
library(dplyr)

POST("http://volcano.si.edu/search_eruption_results.cfm",
     body = list(bp = "", `eruption_category[]` = "", `country[]` = "", polygon = "",  cp = "1"),
     encode = "form") -> res

content(res, as="parsed") %>%
  html_nodes("div.DivTableSearch") %>%
  html_nodes("div.tr") %>%
  map(html_children) %>%
  map(html_text) %>%
  map(as.list) %>%
  map_df(setNames, c("volcano_name", "subregion", "eruption_type",
                     "start_date", "max_vei", "X1")) %>%
  select(-X1)
## # A tibble: 750 × 5
##    volcano_name            subregion      eruption_type  start_date
##           <chr>                <chr>              <chr>       <chr>
## 1   Chirinkotan        Kuril Islands Confirmed Eruption 2016 Nov 29
## 2   Zhupanovsky  Kamchatka Peninsula Confirmed Eruption 2016 Nov 20
## 3       Kerinci              Sumatra Confirmed Eruption 2016 Nov 15
## 4       Langila          New Britain Confirmed Eruption  2016 Nov 3
## 5     Cleveland     Aleutian Islands Confirmed Eruption 2016 Oct 24
## 6         Ebeko        Kuril Islands Confirmed Eruption 2016 Oct 20
## 7        Ulawun          New Britain Confirmed Eruption 2016 Oct 11
## 8      Karymsky  Kamchatka Peninsula Confirmed Eruption  2016 Oct 5
## 9        Ubinas                 Peru Confirmed Eruption  2016 Oct 2
## 10      Rinjani Lesser Sunda Islands Confirmed Eruption 2016 Sep 27
## # ... with 740 more rows, and 1 more variables: max_vei <chr>

我认为可以推断出"Excel"部分,但如果不能,则可以这样:

I assumed the "Excel" part could be inferred, but if not:

POST("http://volcano.si.edu/search_eruption_excel.cfm", 
     body = list(`eruption_category[]` = "", 
                 `country[]` = ""), 
     encode = "form",
     write_disk("eruptions.xls")) -> res

这篇关于使用R来“点击".网页上的下载文件按钮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆