使用R中的rvest软件包从下拉列表提交表单后,从网页下载csv文件 [英] Download csv file from webpage after submitting form from dropdown using rvest package in R

查看:58
本文介绍了使用R中的rvest软件包从下拉列表提交表单后,从网页下载csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在一个网络抓取项目中,从该网页下载各种csv文件:

  • 最后,用不同的季度刷新表格后,如何下载CSV?是否可以直接从网站上读取csv而无需下载文件?

  • 谢谢!

    解决方案

    +1以使用开发人员工具.该工具/技能将很好地为您服务.

    您应该认真考虑使用API​​.但是,您可以为此同时使用 httr rvest (并且我确认这不违反站点规则):

     库(RVest)图书馆(httr)图书馆(tidyverse) 

    我们将首先获取页面,因为我们需要抓取弹出菜单数据:

      pg<-read_html("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link")qtr_nodes<-html_nodes(pg,"select [id ='quarter_one']选项")data_frame(qtr = html_text(qtr_nodes),值= html_attr(qtr_nodes,值"))%&%;%filter(!grepl("ubscri",qtr))->qtrsqtrs###小标题:10 x 2## qtr值##< chr>< chr>## 1电流组合13F/13D/G -1## 2 2017年第三季度13F备案67## 3 2017年第二季度13F申请66## 2017年第一季度4 13F备案65## 5 2016年第四季度13F申请64## 6 2016年第三季度13F备案63## 7 Q2 2016 13F备案62## 8 2016年第一季度13F备案61## 9 2015年第4季度13F备案60## 2015年第三季度10F备案59 

    ^^是从漂亮名称到弹出窗口值的转换表.该值对于提交在后台发生的XHR请求是必需的.

    让我们创建一个函数来模拟该XHR请求:

      get_qtr<-函数(qtr){得到(url ="https://whalewisdom.com/filer/holdings",httr :: add_headers(主机="whalewisdom.com",`User-Agent` ="Mozilla/5.0(Macintosh; Intel Mac OS X 10.13; rv:58.0)Gecko/20100101 Firefox/58.0",接受="application/json,text/javascript,*/*; q = 0.01",`Accept-Language` ="en-US,en; q = 0.5",Referer ="https://whalewisdom.com/filer/blue-harbour-group-lp",`X-Requested-With` ="XMLHttpRequest",连接=保持活动"),查询=列表(q1 = qtr,id ="384",type_filter ="1,2,3,4",符号=",change_filter ="1,2,3,4,5",minimum_ranking =",minimum_shares =",is_etf ="0",sc ="true",`_search` ="false",行="25",页面="1",sidx =当前排名",sord ="asc"))->资源stop_for_status(res)res<-内容(res)map_df(res $ rows,〜map(.x,〜ifelse(is.null(.x),NA,.x)))} 

    我们只是将 value 传递给 qtr 参数,但是您也可以为其他位添加参数.

    现在,使用上面的转换表来获取随机选择的数据集:

      qtr_65<-get_qtr(65)一瞥(qtr_65)##观察结果:19##变量:24## $ id< lgl>NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA## $名称< chr>投资者银行股份有限公司","Xilinx公司","BWX技术股份有限公司","AGCO公司","A ...## $符号< chr>"ISBC","XLNX","BWXT","AGCO","AVT","WBMD","AKAM","RDC","FFIV","ADNT","...## $永久链接< chr>"isbc","xlnx","bwxt","agco","avt","wbmd","akam","rdc","ffiv","adnt","...## $ security_type< chr>"SH","SH","SH","SH","SH","SH","SH","SH","SH","SH","SH","SH","SH"," ...## $ stock_id< int>5284、930、7803、600、375、838、3527、3658、26、198034、5045、72934、812、4116,...## $ source_date< chr>",",",",",",",",,",,",,",,",,",,",",","## $ source_type< chr>",",",",",",",",,",,",,",,",,",,",",","## $扇区< chr>金融",信息技术",工业",工业",信息...## $ industry< chr>"TRUSTS& THRIFTS","SEMICONDUCTORS","Electrical Equipment","MACHINERY","ELE ...## $ current_shares< int>29582428、6058693、5287927、3813700、4363874、3361336、2855493、10542812、93835 ...## $ previous_shares< int>29582428、7514437、10561086、6835700、5415074、1795914、2474193、10542812、8599 ...## $ shares_change< int>0,-1455744,-5273159,-3022000,-1051200,1565422,381300,0,78373,1675570,...## $ position_change_type< chr>不适用,减少",减少",减少",减少",加法",加法",...## $ percent_shares_change< chr>"0.0",-19.3726",-49.9301",-44.2091",-19.4125","87.1658","15.4111","0 ...## $ current_ranking< int>1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,999999,999999,999999## $ previous_ranking< int>3、1、2、4、5、11、7、6、8、999999、10、9、999999、12、14、13、15、16、17## $ current_percent_of_portfolio< dbl>15.3264、12.6366、9.0686、8.2689、7.1946、6.3798、6.1419、5.9180、4.8200、4.387 ...## $ previous_percent_of_portfolio< dbl>13.7987、15.1687、14.0194、13.2249、8.6205、2.9767、5.5165、6.6592、4.1615,NA,...## $ current_mv< chr>"425395000.0","350738000.0","251705000.0","229508000.0","199691000.0","177 ...## $ previous_mv< chr>"412675000.0","453647000.0","419275000.0","395514000.0","257812000.0","890 ...## $ percent_ownership< chr>"26.3298285","2.4338473","5.3287767","4.5501506","3.3856140","8.6721677",...## $ Quarter_first_owned< chr>"2014年第1季度","2015年第1季度","2013年第4季度","2014年第2季度","2015年第2季度","2016年第4季度","2016年第3季度","Q ...## $ quarter_id_owned< int>53、57、52、54、58、64、63、53、61、65、47、63、65、64、64、64、64、60、64 

    我不知道^^是否是CSV中的内容,因为我没有注册帐户,但是您可以验证并希望对其进行修改.

    I am working on a webscraping project to download various csv files from this webpage: https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link

    I would like to be able to programmatically choose the various reported quarters on the drop down list, hit submit (note that the URL for the page doesnt change for each different quarter) and then "Download CSV" for each of the quarters.

    As a disclaimer, I am a novice to rvest and below is my attempt at the solution:

    1. I first checked this site and found a relevant post Using r to navigate and scrape a webpage with drop down html forms

    2. It looks like they use the following code to get a form for what the inputs need to be to refresh an HTML table:

      pgsession <- html_session(url)
      pgform <-html_form(pgsession)[[3]]
      filled_form <-set_values(pgform,
              "team" = "ALL",
              "week" = "1",
              "pos"  = "ALL",
              "year" = "2015"      
       )
      
       submit_form(session=pgsession,form=filled_form, POST=url)
      

    3. I tried doing that for the site above and I get the following instead

       > html_form(html_session("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link"))
       [[1]]
       <form> '<unnamed>' (GET )
        <input text> '': 
        <select> '' [1/7]
      
       [[2]]
       <form> 'frm_registration' (POST /filer/registration)
        <input hidden> 'permalink': blue-harbour-group-lp
        <input hidden> 'registration_type': register
        <input text> 'user_email': 
      
       [[3]]
       <form> 'frm-report-error' (POST /filer/report_error)
        <input hidden> 'permalink': blue-harbour-group-lp
        <input text> 'user_name': 
        <input text> 'user_email': 
        <textarea> 'comments' [0 char]
        <textarea> 'g-recaptcha-response' [0 char]
      

    4. I dont quite see the same set up and the only form it seems with a drop down option is [1] with [1/7] options, but I dont know what that is referring to.

    5. Comparing the source code for both sites, it seems like I have a "form-control" class that I should be extracting? How do I do that?

    6. Finally, after refreshing the table with a different quarter, how do I download the CSV? Is it possible to read the csv from the website directly without downloading the file?

    Thanks!

    解决方案

    +1 for using Developer Tools. That tool/skill will serve you well.

    You should seriously consider using the API. But you can use httr and rvest together for this (and I verified it's not against the site rules):

    library(rvest)
    library(httr)
    library(tidyverse)
    

    We'll get the page first since we need to scrape the popup menu data:

    pg <- read_html("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link")
    
    qtr_nodes <- html_nodes(pg, "select[id='quarter_one'] option")
    
    data_frame(
      qtr = html_text(qtr_nodes),
      value = html_attr(qtr_nodes, "value")
    ) %>% 
      filter(!grepl("ubscri", qtr)) -> qtrs
    
    qtrs
    ## # A tibble: 10 x 2
    ##                           qtr value
    ##                         <chr> <chr>
    ##  1 Current Combined 13F/13D/G    -1
    ##  2       Q3 2017 13F Filings     67
    ##  3       Q2 2017 13F Filings     66
    ##  4       Q1 2017 13F Filings     65
    ##  5       Q4 2016 13F Filings     64
    ##  6       Q3 2016 13F Filings     63
    ##  7       Q2 2016 13F Filings     62
    ##  8       Q1 2016 13F Filings     61
    ##  9       Q4 2015 13F Filings     60
    ## 10       Q3 2015 13F Filings     59
    

    ^^ is a translation table from pretty name to the popup value. The value is necessary for submitting the XHR request that happens behind the scenes.

    Let's make a function to simulate that XHR request:

    get_qtr <- function(qtr) {
    
      GET(
        url = "https://whalewisdom.com/filer/holdings", 
        httr::add_headers(
          Host = "whalewisdom.com", 
          `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:58.0) Gecko/20100101 Firefox/58.0", 
          Accept = "application/json, text/javascript, */*; q=0.01", 
          `Accept-Language` = "en-US,en;q=0.5", 
          Referer = "https://whalewisdom.com/filer/blue-harbour-group-lp", 
          `X-Requested-With` = "XMLHttpRequest", 
          Connection = "keep-alive"
        ),
        query = list(
          q1 = qtr, 
          id = "384", type_filter = "1,2,3,4", symbol = "", 
          change_filter = "1,2,3,4,5", minimum_ranking = "", minimum_shares = "", 
          is_etf = "0", sc = "true", `_search` = "false", rows = "25", 
          page = "1", sidx = "current_ranking", sord = "asc"
        )
      ) -> res
    
      stop_for_status(res)
    
      res <- content(res)
    
      map_df(res$rows, ~map(.x, ~ifelse(is.null(.x), NA, .x)))
    
    }
    

    We're just passing in value to the qtr parameter, but you could add params for the other bits, too.

    Now, use the translation table above to get a randomly chosen set of data:

    qtr_65 <- get_qtr(65)
    
    glimpse(qtr_65)
    ## Observations: 19
    ## Variables: 24
    ## $ id                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
    ## $ name                          <chr> "Investors Bancorp Inc", "Xilinx, Inc", "BWX TECHNOLOGIES INC", "AGCO Corp", "A...
    ## $ symbol                        <chr> "ISBC", "XLNX", "BWXT", "AGCO", "AVT", "WBMD", "AKAM", "RDC", "FFIV", "ADNT", "...
    ## $ permalink                     <chr> "isbc", "xlnx", "bwxt", "agco", "avt", "wbmd", "akam", "rdc", "ffiv", "adnt", "...
    ## $ security_type                 <chr> "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "...
    ## $ stock_id                      <int> 5284, 930, 7803, 600, 375, 838, 3527, 3658, 26, 198034, 5045, 72934, 812, 4116,...
    ## $ source_date                   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
    ## $ source_type                   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
    ## $ sector                        <chr> "FINANCE", "INFORMATION TECHNOLOGY", "INDUSTRIALS", "INDUSTRIALS", "INFORMATION...
    ## $ industry                      <chr> "TRUSTS & THRIFTS", "SEMICONDUCTORS", "ELECTRICAL EQUIPMENT", "MACHINERY", "ELE...
    ## $ current_shares                <int> 29582428, 6058693, 5287927, 3813700, 4363874, 3361336, 2855493, 10542812, 93835...
    ## $ previous_shares               <int> 29582428, 7514437, 10561086, 6835700, 5415074, 1795914, 2474193, 10542812, 8599...
    ## $ shares_change                 <int> 0, -1455744, -5273159, -3022000, -1051200, 1565422, 381300, 0, 78373, 1675570, ...
    ## $ position_change_type          <chr> NA, "reduction", "reduction", "reduction", "reduction", "addition", "addition",...
    ## $ percent_shares_change         <chr> "0.0", "-19.3726", "-49.9301", "-44.2091", "-19.4125", "87.1658", "15.4111", "0...
    ## $ current_ranking               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 999999, 999999, 999999
    ## $ previous_ranking              <int> 3, 1, 2, 4, 5, 11, 7, 6, 8, 999999, 10, 9, 999999, 12, 14, 13, 15, 16, 17
    ## $ current_percent_of_portfolio  <dbl> 15.3264, 12.6366, 9.0686, 8.2689, 7.1946, 6.3798, 6.1419, 5.9180, 4.8200, 4.387...
    ## $ previous_percent_of_portfolio <dbl> 13.7987, 15.1687, 14.0194, 13.2249, 8.6205, 2.9767, 5.5165, 6.6592, 4.1615, NA,...
    ## $ current_mv                    <chr> "425395000.0", "350738000.0", "251705000.0", "229508000.0", "199691000.0", "177...
    ## $ previous_mv                   <chr> "412675000.0", "453647000.0", "419275000.0", "395514000.0", "257812000.0", "890...
    ## $ percent_ownership             <chr> "26.3298285", "2.4338473", "5.3287767", "4.5501506", "3.3856140", "8.6721677", ...
    ## $ quarter_first_owned           <chr> "Q1 2014", "Q1 2015", "Q4 2013", "Q2 2014", "Q2 2015", "Q4 2016", "Q3 2016", "Q...
    ## $ quarter_id_owned              <int> 53, 57, 52, 54, 58, 64, 63, 53, 61, 65, 47, 63, 65, 64, 64, 64, 64, 60, 64
    

    I have no idea if ^^ is what's in the CSV since I'm not registering for an account, but you can verify and hopefully modify.

    这篇关于使用R中的rvest软件包从下拉列表提交表单后,从网页下载csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆