在 R 中使用表单输入进行 rvest Webscraping [英] rvest Webscraping in R with form inputs

查看:33
本文介绍了在 R 中使用表单输入进行 rvest Webscraping的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法解决 R 中的这个问题,如果您能在这里给我一些建议,我将不胜感激.

I can't get my head around this problem in R and I would really appreciate if you could leave a piece of advice for me here.

我正在尝试从 https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data 仅供个人使用(当然).

I am trying to scrape historical bond yield data from https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data for personal use only (of course).

此处提供的解决方案非常有效,但只能抓取每日数据的前 24 个时间戳:从网页抓取数据表和数据

The solution provided here works really well but only goes as far as to scrape the first 24 time stamps of daily data: webscraping data tables and data from a web page

我想要实现的是更改日期范围以获取更多历史数据.基于 SelectorGadget 工具,日期范围的输入表单 id 称为 //*[(@id = "widgetFieldDateRange")]

What I am trying to achieve is to change the date range in order to scrape more historical data. Based on the SelectorGadget tool, the input form id for the date range is called //*[(@id = "widgetFieldDateRange")]

我也尝试使用以下代码行来更改日期值但没有成功:

I have also tried using the following lines of code to change the date values but without success:

library(rvest)
 
url1 <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data" #Spain 5yr yield

session <- html_session(url1)
pgform <- html_form(session)[[1]]

pgform$fields[[3]]$value <- "01/01/2010 - 09/10/2020"
result <- submit_form(session, pgform)

问题:知道如何正确提交新日期范围并检索扩展时间序列吗?

非常感谢您的帮助!

PS:不幸的是,URL 不会根据日期范围而改变.

PS: Unfortunately, the URL does not change based on the date range.

推荐答案

可以直接执行POST请求:

You can perform the POST request directly :

POST https://www.investing.com/instruments/HistoricalDataAjax

您需要从页面中抓取一些请求中必需的信息:

You need to scrape a few information from the page that are necessary in the request :

  • 来自 div 标签的 pair_ids 属性
  • 来自 .instrumentHeader 类中的 h2 标签的标头值
  • the pair_ids attribute from a div tag
  • the header value from h2 tag inside .instrumentHeader class

完整代码:

library(rvest)
library(httr)

startDate <- as.Date("2020-06-01")
endDate <- Sys.Date() #today

userAgent <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
mainUrl <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data"

s <- html_session(mainUrl)

pair_ids <- s %>% 
    html_nodes("div[pair_ids]") %>%
    html_attr("pair_ids")

header <- s %>% html_nodes(".instrumentHeader h2") %>% html_text()

resp <- s %>% rvest:::request_POST(
    "https://www.investing.com/instruments/HistoricalDataAjax",
    add_headers('X-Requested-With'= 'XMLHttpRequest'),
    user_agent(userAgent),
    body = list(
        curr_id = pair_ids,
        header = header[[1]],
        st_date = format(startDate, format="%m/%d/%Y"),
        end_date = format(endDate, format="%m/%d/%Y"),
        interval_sec = "Daily",
        sort_col = "date",
        sort_ord = "DESC",
        action = "historical_data"
    ), 
    encode = "form") %>%
    html_table

print(resp[[1]])

输出:

            Date  Price   Open   High    Low Change %
1   Oct 09, 2020 -0.339 -0.338 -0.333 -0.361    2.42%
2   Oct 08, 2020 -0.331 -0.306 -0.306 -0.338    7.47%
3   Oct 07, 2020 -0.308 -0.323 -0.300 -0.324   -0.65%
4   Oct 06, 2020 -0.310 -0.288 -0.278 -0.319    7.27%
5   Oct 05, 2020 -0.289 -0.323 -0.278 -0.331  -10.39%
6   Oct 03, 2020 -0.322 -0.322 -0.322 -0.322    1.42%
7   Oct 02, 2020 -0.318 -0.311 -0.302 -0.320    5.65%
.....................................................
.....................................................
96  Jun 08, 2020 -0.162 -0.152 -0.133 -0.173   13.29%
97  Jun 05, 2020 -0.143 -0.129 -0.127 -0.154   13.49%
98  Jun 04, 2020 -0.126 -0.089 -0.063 -0.148   38.46%
99  Jun 03, 2020 -0.091 -0.120 -0.087 -0.128  -35.00%
100 Jun 02, 2020 -0.140 -0.148 -0.137 -0.166   14.75%
101 Jun 01, 2020 -0.122 -0.140 -0.101 -0.150  -17.57%

这也适用于任何页面,如果您替换 mainUrl 变量的值,例如 这个

This also works for any page if you replace the value of mainUrl variable for instance this one

这篇关于在 R 中使用表单输入进行 rvest Webscraping的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆