使用 R 来“点击"网页上的下载文件按钮 [英] Using R to "click" a download file button on a webpage

查看:23
本文介绍了使用 R 来“点击"网页上的下载文件按钮的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用此网页 http://volcano.si.edu/search_eruption.cfm 来抓取数据.有两个下拉框要求对数据进行过滤.我不需要过滤的数据,所以我将这些数据留空,然后点击搜索喷发"继续到下一页.

I am attempting to use this webpage http://volcano.si.edu/search_eruption.cfm to scrape data. There are two drop-down boxes that ask for filters of the data. I do not need filtered data, so I leave those blank and continue on to the next page by clicking "Search Eruptions".

不过,我注意到结果表只包含少量列(只有 5 个),而它应该包含的列总数(总共 24 个)相比.但是,如果您单击将结果下载到 Excel",所有 24 列都将存在.按钮并打开下载的文件.这就是我需要的.

What I have noticed, though, is that the resulting table only includes a small amount of columns (only 5) compared to the total amount of columns (total of 24) it should have. However, all 24 columns will be there if you click the "Download Results to Excel" button and open the downloaded file. This is what I need.

所以,看起来这已经从抓取练习(使用 httr 和 rvest)变成了更困难的事情.然而,我对如何真正点击"我感到困惑.在将结果下载到 Excel"上使用 R 的按钮.我的猜测是我将不得不使用 RSelenium,但这里是我的代码,尝试将 httr 与 POST 一起使用,以防万一你们中的任何人都可以找到更简单的方法.我也尝试过使用 gdata、data.table、XML 等,但无济于事,这可能只是用户错误的结果.

So, it looks like this has turned from a scraping exercise (using httr and rvest) into something more difficult. However, I'm stumped on how to actually "click" on the "Download Results to Excel" button using R. My guess is I will have to use RSelenium, but here is my code trying to use httr with POST in case there is an easier way that any of you kind people can find. I've also tried using gdata, data.table, XML, etc. to no avail which could just be a result of user error.

此外,了解无法右键单击下载按钮以显示 URL 可能会有所帮助.

Also, it might be helpful to know that the download button cannot be right-clicked to show a URL.

url <- "http://volcano.si.edu/database/search_eruption_results.cfm"

searchcriteria <- list(
    eruption_category = "",
    country = ""
)

mydata <- POST(url, body = "searchcriteria")

在浏览器中使用 Inspector,我能够看到这两个过滤器是eruption_category";和国家"因为我不需要任何过滤数据,所以两者都是空白的.

Using the Inspector in my browser, I was able to see that the two filters are "eruption_category" and "country" and both will be blank since I do not need any filtered data.

最后,似乎上面的代码会让我进入只有 5 列的表格的页面.但是,我仍然无法在下面的代码中使用 rvest 抓取这个表(使用 SelectorGadget 只抓取一列).最后,这部分并不重要,因为正如我上面所说的,我需要所有 24 列,而不仅仅是这 5 列.但是,如果您发现我在下面所做的操作有任何错误,我将不胜感激.

Lastly, it would seem that the above code will get me on to the page that has the table with only 5 columns. However, I was still unable to scrape this table using rvest in the code below (using SelectorGadget to scrape just one column). In the end, this part doesn't matter as much because, as I had said above, I need all 24 columns, not just these 5. But, if you find any errors with what I did below as well, I would be grateful.

Eruptions <- mydata %>%
    read_html() %>%
    html_nodes(".td8") %>%
    html_text()
Eruptions

感谢您提供的任何帮助.

Thank you for any help you can provide.

推荐答案

只需模仿它所做的 POST:

library(httr)
library(rvest)
library(purrr)
library(dplyr)

POST("http://volcano.si.edu/search_eruption_results.cfm",
     body = list(bp = "", `eruption_category[]` = "", `country[]` = "", polygon = "",  cp = "1"),
     encode = "form") -> res

content(res, as="parsed") %>%
  html_nodes("div.DivTableSearch") %>%
  html_nodes("div.tr") %>%
  map(html_children) %>%
  map(html_text) %>%
  map(as.list) %>%
  map_df(setNames, c("volcano_name", "subregion", "eruption_type",
                     "start_date", "max_vei", "X1")) %>%
  select(-X1)
## # A tibble: 750 × 5
##    volcano_name            subregion      eruption_type  start_date
##           <chr>                <chr>              <chr>       <chr>
## 1   Chirinkotan        Kuril Islands Confirmed Eruption 2016 Nov 29
## 2   Zhupanovsky  Kamchatka Peninsula Confirmed Eruption 2016 Nov 20
## 3       Kerinci              Sumatra Confirmed Eruption 2016 Nov 15
## 4       Langila          New Britain Confirmed Eruption  2016 Nov 3
## 5     Cleveland     Aleutian Islands Confirmed Eruption 2016 Oct 24
## 6         Ebeko        Kuril Islands Confirmed Eruption 2016 Oct 20
## 7        Ulawun          New Britain Confirmed Eruption 2016 Oct 11
## 8      Karymsky  Kamchatka Peninsula Confirmed Eruption  2016 Oct 5
## 9        Ubinas                 Peru Confirmed Eruption  2016 Oct 2
## 10      Rinjani Lesser Sunda Islands Confirmed Eruption 2016 Sep 27
## # ... with 740 more rows, and 1 more variables: max_vei <chr>

我假设可以推断出Excel"部分,但如果不能:

I assumed the "Excel" part could be inferred, but if not:

POST("http://volcano.si.edu/search_eruption_excel.cfm", 
     body = list(`eruption_category[]` = "", 
                 `country[]` = ""), 
     encode = "form",
     write_disk("eruptions.xls")) -> res

这篇关于使用 R 来“点击"网页上的下载文件按钮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆