在 R 中使用表单输入进行 rvest Webscraping [英] rvest Webscraping in R with form inputs
问题描述
我无法解决 R 中的这个问题,如果您能在这里给我一些建议,我将不胜感激.
I can't get my head around this problem in R and I would really appreciate if you could leave a piece of advice for me here.
我正在尝试从 https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data 仅供个人使用(当然).
I am trying to scrape historical bond yield data from https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data for personal use only (of course).
此处提供的解决方案非常有效,但只能抓取每日数据的前 24 个时间戳:从网页抓取数据表和数据
The solution provided here works really well but only goes as far as to scrape the first 24 time stamps of daily data: webscraping data tables and data from a web page
我想要实现的是更改日期范围以获取更多历史数据.基于 SelectorGadget 工具,日期范围的输入表单 id 称为 //*[(@id = "widgetFieldDateRange")]
What I am trying to achieve is to change the date range in order to scrape more historical data.
Based on the SelectorGadget tool, the input form id for the date range is called //*[(@id = "widgetFieldDateRange")]
我也尝试使用以下代码行来更改日期值但没有成功:
I have also tried using the following lines of code to change the date values but without success:
library(rvest)
url1 <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data" #Spain 5yr yield
session <- html_session(url1)
pgform <- html_form(session)[[1]]
pgform$fields[[3]]$value <- "01/01/2010 - 09/10/2020"
result <- submit_form(session, pgform)
问题:知道如何正确提交新日期范围并检索扩展时间序列吗?
非常感谢您的帮助!
PS:不幸的是,URL 不会根据日期范围而改变.
PS: Unfortunately, the URL does not change based on the date range.
推荐答案
可以直接执行POST请求:
You can perform the POST request directly :
POST https://www.investing.com/instruments/HistoricalDataAjax
您需要从页面中抓取一些请求中必需的信息:
You need to scrape a few information from the page that are necessary in the request :
- 来自
div
标签的pair_ids
属性 - 来自
.instrumentHeader
类中的h2
标签的标头值
- the
pair_ids
attribute from adiv
tag - the header value from
h2
tag inside.instrumentHeader
class
完整代码:
library(rvest)
library(httr)
startDate <- as.Date("2020-06-01")
endDate <- Sys.Date() #today
userAgent <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
mainUrl <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data"
s <- html_session(mainUrl)
pair_ids <- s %>%
html_nodes("div[pair_ids]") %>%
html_attr("pair_ids")
header <- s %>% html_nodes(".instrumentHeader h2") %>% html_text()
resp <- s %>% rvest:::request_POST(
"https://www.investing.com/instruments/HistoricalDataAjax",
add_headers('X-Requested-With'= 'XMLHttpRequest'),
user_agent(userAgent),
body = list(
curr_id = pair_ids,
header = header[[1]],
st_date = format(startDate, format="%m/%d/%Y"),
end_date = format(endDate, format="%m/%d/%Y"),
interval_sec = "Daily",
sort_col = "date",
sort_ord = "DESC",
action = "historical_data"
),
encode = "form") %>%
html_table
print(resp[[1]])
输出:
Date Price Open High Low Change %
1 Oct 09, 2020 -0.339 -0.338 -0.333 -0.361 2.42%
2 Oct 08, 2020 -0.331 -0.306 -0.306 -0.338 7.47%
3 Oct 07, 2020 -0.308 -0.323 -0.300 -0.324 -0.65%
4 Oct 06, 2020 -0.310 -0.288 -0.278 -0.319 7.27%
5 Oct 05, 2020 -0.289 -0.323 -0.278 -0.331 -10.39%
6 Oct 03, 2020 -0.322 -0.322 -0.322 -0.322 1.42%
7 Oct 02, 2020 -0.318 -0.311 -0.302 -0.320 5.65%
.....................................................
.....................................................
96 Jun 08, 2020 -0.162 -0.152 -0.133 -0.173 13.29%
97 Jun 05, 2020 -0.143 -0.129 -0.127 -0.154 13.49%
98 Jun 04, 2020 -0.126 -0.089 -0.063 -0.148 38.46%
99 Jun 03, 2020 -0.091 -0.120 -0.087 -0.128 -35.00%
100 Jun 02, 2020 -0.140 -0.148 -0.137 -0.166 14.75%
101 Jun 01, 2020 -0.122 -0.140 -0.101 -0.150 -17.57%
这也适用于任何页面,如果您替换 mainUrl
变量的值,例如 这个
This also works for any page if you replace the value of mainUrl
variable for instance this one
这篇关于在 R 中使用表单输入进行 rvest Webscraping的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!