无法使用rvest抓取具有表单的网站 [英] Unable to scrape website with form using rvest

查看:0
本文介绍了无法使用rvest抓取具有表单的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取下面列出的网站。我尝试通过使用rvest和下面的代码来完成此操作。

我的尝试是尝试复制我在Google Chrome中找到的PUT下载按钮。我不确定我做错了什么。我的reprex中列出了错误。

  library(httr)
  library(rvest)
  library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

  
  
  url <- "https://nfc.shgn.com/adp/baseball"
  pgsession <- session(url)
  
  pgform <- html_form(pgsession)[[2]]

  filled_form <- html_form_set(pgform,
                            team_id = "0", from_date = "2020-10-01", to_date = "2021-02-19", num_teams = "0",
                            draft_type = "0", sport = "baseball", position = "",
                            league_teams = "0" )
#> Warning: Setting value of hidden field 'team_id'.
#> Warning: Setting value of hidden field 'from_date'.
#> Warning: Setting value of hidden field 'to_date'.
#> Warning: Setting value of hidden field 'num_teams'.
#> Warning: Setting value of hidden field 'draft_type'.
#> Warning: Setting value of hidden field 'sport'.
#> Warning: Setting value of hidden field 'position'.
#> Warning: Setting value of hidden field 'league_teams'.
  
  session_submit(x = pgsession, form = filled_form)
#> Error: `form` doesn't contain a `action` attribute

推荐答案

如果您只想擦除表,您可以使用rvestpurrr轻松实现,方法是使用<1>打印>按钮将您带到的URL。

虽然您不能使用html_table,但使用purrr::map_df将单元格提取为数据帧很简单:

library(rvest)
library(dplyr)
library(purrr)
library(stringr)

pgtab <- read_html("https://nfc.shgn.com/adp.data.php") %>%  #destination of Print button
  html_nodes("tr") %>%                 #returns a list of row nodes
  map_df(~html_nodes(., "td") %>%      #returns a list of cell nodes for each row
           html_text() %>%             #extract text
           str_trim() %>%              #remove whitespace
           set_names("Rank","Player","Team","Position","ADP","MinPick",
                     "MaxPick","Diff","Picks","Team2","PickBid"))

head(pgtab)

# A tibble: 6 x 11
  Rank  Player             Team  Position ADP   MinPick MaxPick Diff  Picks Team2 PickBid
  <chr> <chr>              <chr> <chr>    <chr> <chr>   <chr>   <chr> <chr> <chr> <chr>  
1 1     Ronald Acuna Jr.   ATL   OF       1.69  1       6       ""    332   ""    ""     
2 2     Fernando Tatis Jr. SD    SS       2.57  1       7       ""    332   ""    ""     
3 3     Mookie Betts       LAD   OF       3.53  1       9       ""    332   ""    ""     
4 4     Juan Soto          WAS   OF       3.98  1       10      ""    332   ""    ""     
5 5     Mike Trout         LAA   OF       6.08  1       11      ""    332   ""    ""     
6 6     Gerrit Cole        NYY   P        6.50  1       15      ""    332   ""    ""     

您还可以设置表单参数并执行此操作,尽管您必须检查这是否有影响。这里有一种方法...

url <- "https://nfc.shgn.com/adp/baseball"
pgsession <- html_session(url)

pgform <- html_form(pgsession)[[2]]

filled_form <-set_values(pgform,
                         team_id = "0", from_date = "2020-10-01", to_date = "2021-02-19", num_teams = "0",
                         draft_type = "0", sport = "baseball", position = "",
                         league_teams = "0" )

filled_form$url <- "https://nfc.shgn.com/adp.data.php" #error if this is left blank

pgsession <- submit_form(pgsession, filled_form, submit = "printerFriendly")

pgtab <- pgsession %>% read_html() %>% #code as per previous answer above
  html_nodes("tr") %>% 
  map_df(~html_nodes(., "td") %>% 
           html_text() %>% 
           str_trim() %>% 
           set_names("Rank","Player","Team","Position","ADP","MinPick",
                     "MaxPick","Diff","Picks","Team2","PickBid"))

这篇关于无法使用rvest抓取具有表单的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆