使用R中的rvest软件包从下拉列表提交表单后,从网页下载csv文件 [英] Download csv file from webpage after submitting form from dropdown using rvest package in R
问题描述
我正在一个网络抓取项目中,从该网页下载各种csv文件:
最后,用不同的季度刷新表格后,如何下载CSV?是否可以直接从网站上读取csv而无需下载文件?
谢谢!
+1以使用开发人员工具.该工具/技能将很好地为您服务.
您应该认真考虑使用API.但是,您可以为此同时使用 httr
和 rvest
(并且我确认这不违反站点规则):
库(RVest)图书馆(httr)图书馆(tidyverse)
我们将首先获取页面,因为我们需要抓取弹出菜单数据:
pg<-read_html("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link")qtr_nodes<-html_nodes(pg,"select [id ='quarter_one']选项")data_frame(qtr = html_text(qtr_nodes),值= html_attr(qtr_nodes,值"))%&%;%filter(!grepl("ubscri",qtr))->qtrsqtrs###小标题:10 x 2## qtr值##< chr>< chr>## 1电流组合13F/13D/G -1## 2 2017年第三季度13F备案67## 3 2017年第二季度13F申请66## 2017年第一季度4 13F备案65## 5 2016年第四季度13F申请64## 6 2016年第三季度13F备案63## 7 Q2 2016 13F备案62## 8 2016年第一季度13F备案61## 9 2015年第4季度13F备案60## 2015年第三季度10F备案59
^^是从漂亮名称到弹出窗口值的转换表.该值对于提交在后台发生的XHR请求是必需的.
让我们创建一个函数来模拟该XHR请求:
get_qtr<-函数(qtr){得到(url ="https://whalewisdom.com/filer/holdings",httr :: add_headers(主机="whalewisdom.com",`User-Agent` ="Mozilla/5.0(Macintosh; Intel Mac OS X 10.13; rv:58.0)Gecko/20100101 Firefox/58.0",接受="application/json,text/javascript,*/*; q = 0.01",`Accept-Language` ="en-US,en; q = 0.5",Referer ="https://whalewisdom.com/filer/blue-harbour-group-lp",`X-Requested-With` ="XMLHttpRequest",连接=保持活动"),查询=列表(q1 = qtr,id ="384",type_filter ="1,2,3,4",符号=",change_filter ="1,2,3,4,5",minimum_ranking =",minimum_shares =",is_etf ="0",sc ="true",`_search` ="false",行="25",页面="1",sidx =当前排名",sord ="asc"))->资源stop_for_status(res)res<-内容(res)map_df(res $ rows,〜map(.x,〜ifelse(is.null(.x),NA,.x)))}
我们只是将 value
传递给 qtr
参数,但是您也可以为其他位添加参数.
现在,使用上面的转换表来获取随机选择的数据集:
qtr_65<-get_qtr(65)一瞥(qtr_65)##观察结果:19##变量:24## $ id< lgl>NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA## $名称< chr>投资者银行股份有限公司","Xilinx公司","BWX技术股份有限公司","AGCO公司","A ...## $符号< chr>"ISBC","XLNX","BWXT","AGCO","AVT","WBMD","AKAM","RDC","FFIV","ADNT","...## $永久链接< chr>"isbc","xlnx","bwxt","agco","avt","wbmd","akam","rdc","ffiv","adnt","...## $ security_type< chr>"SH","SH","SH","SH","SH","SH","SH","SH","SH","SH","SH","SH","SH"," ...## $ stock_id< int>5284、930、7803、600、375、838、3527、3658、26、198034、5045、72934、812、4116,...## $ source_date< chr>",",",",",",",",,",,",,",,",,",,",",","## $ source_type< chr>",",",",",",",",,",,",,",,",,",,",",","## $扇区< chr>金融",信息技术",工业",工业",信息...## $ industry< chr>"TRUSTS& THRIFTS","SEMICONDUCTORS","Electrical Equipment","MACHINERY","ELE ...## $ current_shares< int>29582428、6058693、5287927、3813700、4363874、3361336、2855493、10542812、93835 ...## $ previous_shares< int>29582428、7514437、10561086、6835700、5415074、1795914、2474193、10542812、8599 ...## $ shares_change< int>0,-1455744,-5273159,-3022000,-1051200,1565422,381300,0,78373,1675570,...## $ position_change_type< chr>不适用,减少",减少",减少",减少",加法",加法",...## $ percent_shares_change< chr>"0.0",-19.3726",-49.9301",-44.2091",-19.4125","87.1658","15.4111","0 ...## $ current_ranking< int>1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,999999,999999,999999## $ previous_ranking< int>3、1、2、4、5、11、7、6、8、999999、10、9、999999、12、14、13、15、16、17## $ current_percent_of_portfolio< dbl>15.3264、12.6366、9.0686、8.2689、7.1946、6.3798、6.1419、5.9180、4.8200、4.387 ...## $ previous_percent_of_portfolio< dbl>13.7987、15.1687、14.0194、13.2249、8.6205、2.9767、5.5165、6.6592、4.1615,NA,...## $ current_mv< chr>"425395000.0","350738000.0","251705000.0","229508000.0","199691000.0","177 ...## $ previous_mv< chr>"412675000.0","453647000.0","419275000.0","395514000.0","257812000.0","890 ...## $ percent_ownership< chr>"26.3298285","2.4338473","5.3287767","4.5501506","3.3856140","8.6721677",...## $ Quarter_first_owned< chr>"2014年第1季度","2015年第1季度","2013年第4季度","2014年第2季度","2015年第2季度","2016年第4季度","2016年第3季度","Q ...## $ quarter_id_owned< int>53、57、52、54、58、64、63、53、61、65、47、63、65、64、64、64、64、60、64
我不知道^^是否是CSV中的内容,因为我没有注册帐户,但是您可以验证并希望对其进行修改.
I am working on a webscraping project to download various csv files from this webpage: https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link
I would like to be able to programmatically choose the various reported quarters on the drop down list, hit submit (note that the URL for the page doesnt change for each different quarter) and then "Download CSV" for each of the quarters.
As a disclaimer, I am a novice to rvest and below is my attempt at the solution:
I first checked this site and found a relevant post Using r to navigate and scrape a webpage with drop down html forms
It looks like they use the following code to get a form for what the inputs need to be to refresh an HTML table:
pgsession <- html_session(url) pgform <-html_form(pgsession)[[3]] filled_form <-set_values(pgform, "team" = "ALL", "week" = "1", "pos" = "ALL", "year" = "2015" ) submit_form(session=pgsession,form=filled_form, POST=url)
I tried doing that for the site above and I get the following instead
> html_form(html_session("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link")) [[1]] <form> '<unnamed>' (GET ) <input text> '': <select> '' [1/7] [[2]] <form> 'frm_registration' (POST /filer/registration) <input hidden> 'permalink': blue-harbour-group-lp <input hidden> 'registration_type': register <input text> 'user_email': [[3]] <form> 'frm-report-error' (POST /filer/report_error) <input hidden> 'permalink': blue-harbour-group-lp <input text> 'user_name': <input text> 'user_email': <textarea> 'comments' [0 char] <textarea> 'g-recaptcha-response' [0 char]
I dont quite see the same set up and the only form it seems with a drop down option is [1] with [1/7] options, but I dont know what that is referring to.
Comparing the source code for both sites, it seems like I have a "form-control" class that I should be extracting? How do I do that?
Finally, after refreshing the table with a different quarter, how do I download the CSV? Is it possible to read the csv from the website directly without downloading the file?
Thanks!
+1 for using Developer Tools. That tool/skill will serve you well.
You should seriously consider using the API. But you can use httr
and rvest
together for this (and I verified it's not against the site rules):
library(rvest)
library(httr)
library(tidyverse)
We'll get the page first since we need to scrape the popup menu data:
pg <- read_html("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link")
qtr_nodes <- html_nodes(pg, "select[id='quarter_one'] option")
data_frame(
qtr = html_text(qtr_nodes),
value = html_attr(qtr_nodes, "value")
) %>%
filter(!grepl("ubscri", qtr)) -> qtrs
qtrs
## # A tibble: 10 x 2
## qtr value
## <chr> <chr>
## 1 Current Combined 13F/13D/G -1
## 2 Q3 2017 13F Filings 67
## 3 Q2 2017 13F Filings 66
## 4 Q1 2017 13F Filings 65
## 5 Q4 2016 13F Filings 64
## 6 Q3 2016 13F Filings 63
## 7 Q2 2016 13F Filings 62
## 8 Q1 2016 13F Filings 61
## 9 Q4 2015 13F Filings 60
## 10 Q3 2015 13F Filings 59
^^ is a translation table from pretty name to the popup value. The value is necessary for submitting the XHR request that happens behind the scenes.
Let's make a function to simulate that XHR request:
get_qtr <- function(qtr) {
GET(
url = "https://whalewisdom.com/filer/holdings",
httr::add_headers(
Host = "whalewisdom.com",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:58.0) Gecko/20100101 Firefox/58.0",
Accept = "application/json, text/javascript, */*; q=0.01",
`Accept-Language` = "en-US,en;q=0.5",
Referer = "https://whalewisdom.com/filer/blue-harbour-group-lp",
`X-Requested-With` = "XMLHttpRequest",
Connection = "keep-alive"
),
query = list(
q1 = qtr,
id = "384", type_filter = "1,2,3,4", symbol = "",
change_filter = "1,2,3,4,5", minimum_ranking = "", minimum_shares = "",
is_etf = "0", sc = "true", `_search` = "false", rows = "25",
page = "1", sidx = "current_ranking", sord = "asc"
)
) -> res
stop_for_status(res)
res <- content(res)
map_df(res$rows, ~map(.x, ~ifelse(is.null(.x), NA, .x)))
}
We're just passing in value
to the qtr
parameter, but you could add params for the other bits, too.
Now, use the translation table above to get a randomly chosen set of data:
qtr_65 <- get_qtr(65)
glimpse(qtr_65)
## Observations: 19
## Variables: 24
## $ id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ name <chr> "Investors Bancorp Inc", "Xilinx, Inc", "BWX TECHNOLOGIES INC", "AGCO Corp", "A...
## $ symbol <chr> "ISBC", "XLNX", "BWXT", "AGCO", "AVT", "WBMD", "AKAM", "RDC", "FFIV", "ADNT", "...
## $ permalink <chr> "isbc", "xlnx", "bwxt", "agco", "avt", "wbmd", "akam", "rdc", "ffiv", "adnt", "...
## $ security_type <chr> "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "...
## $ stock_id <int> 5284, 930, 7803, 600, 375, 838, 3527, 3658, 26, 198034, 5045, 72934, 812, 4116,...
## $ source_date <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
## $ source_type <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
## $ sector <chr> "FINANCE", "INFORMATION TECHNOLOGY", "INDUSTRIALS", "INDUSTRIALS", "INFORMATION...
## $ industry <chr> "TRUSTS & THRIFTS", "SEMICONDUCTORS", "ELECTRICAL EQUIPMENT", "MACHINERY", "ELE...
## $ current_shares <int> 29582428, 6058693, 5287927, 3813700, 4363874, 3361336, 2855493, 10542812, 93835...
## $ previous_shares <int> 29582428, 7514437, 10561086, 6835700, 5415074, 1795914, 2474193, 10542812, 8599...
## $ shares_change <int> 0, -1455744, -5273159, -3022000, -1051200, 1565422, 381300, 0, 78373, 1675570, ...
## $ position_change_type <chr> NA, "reduction", "reduction", "reduction", "reduction", "addition", "addition",...
## $ percent_shares_change <chr> "0.0", "-19.3726", "-49.9301", "-44.2091", "-19.4125", "87.1658", "15.4111", "0...
## $ current_ranking <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 999999, 999999, 999999
## $ previous_ranking <int> 3, 1, 2, 4, 5, 11, 7, 6, 8, 999999, 10, 9, 999999, 12, 14, 13, 15, 16, 17
## $ current_percent_of_portfolio <dbl> 15.3264, 12.6366, 9.0686, 8.2689, 7.1946, 6.3798, 6.1419, 5.9180, 4.8200, 4.387...
## $ previous_percent_of_portfolio <dbl> 13.7987, 15.1687, 14.0194, 13.2249, 8.6205, 2.9767, 5.5165, 6.6592, 4.1615, NA,...
## $ current_mv <chr> "425395000.0", "350738000.0", "251705000.0", "229508000.0", "199691000.0", "177...
## $ previous_mv <chr> "412675000.0", "453647000.0", "419275000.0", "395514000.0", "257812000.0", "890...
## $ percent_ownership <chr> "26.3298285", "2.4338473", "5.3287767", "4.5501506", "3.3856140", "8.6721677", ...
## $ quarter_first_owned <chr> "Q1 2014", "Q1 2015", "Q4 2013", "Q2 2014", "Q2 2015", "Q4 2016", "Q3 2016", "Q...
## $ quarter_id_owned <int> 53, 57, 52, 54, 58, 64, 63, 53, 61, 65, 47, 63, 65, 64, 64, 64, 64, 60, 64
I have no idea if ^^ is what's in the CSV since I'm not registering for an account, but you can verify and hopefully modify.
这篇关于使用R中的rvest软件包从下拉列表提交表单后,从网页下载csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!