HTML表有两个标题时使用rvest包 [英] Using rvest package when HTML table has two headers

查看:30
本文介绍了HTML表有两个标题时使用rvest包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码在AFL播放器数据上抓取HTML表:

I am using the following code to scrape an HTML table on AFL player data:

library(rvest)

website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html")
table   <- website %>%
           html_nodes("table") %>%
           .[(1)] %>%
           html_table()

结果表为34磅。 27个变量中的任意一个,但是 nrow(table) ncol(table)都返回NULL。这是因为数据帧中有两行标题是正确的吗?我希望能够基于各个列进行计算,但是以下给出了错误:

The resulting table is 34 obs. of 27 variables, however nrow(table) or ncol(table) both return NULL. Is it correct that this is because there are two rows of headers in the dataframe? I want to be able to do calculations based on individual columns however the following gives an error:

table[,1]
# Error in table[, 1] : incorrect number of dimensions

这会产生此错误以及如何解决?

Which does it produce this error and how can I solve it?

推荐答案




library(rvest)
#> Le chargement a nécessité le package : xml2

website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html")

在此网站上,您有几个表,每个链接一个显示在
上方的主页上。
html_nodes( tables)的结果上使用 html_tables 允许您获取所有表

On this website, you have several tables, one per link displayed above the printed table on the main page. Using html_tables on the result of html_nodes("tables") allows you to get all the tables in a list at once.

all_tables <- website %>%
  html_nodes("table") %>%
  html_table()

str(all_tables, 1)
#> List of 23
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:

然后可以选择所需的表,但标题仍然不是
正确

You can then select the table you want but the header are still not right

head(all_tables[[1]])
#>          Disposals Disposals Disposals Disposals Disposals Disposals
#> 1           Player        R1        R2        R3        R4        R5
#> 2     Atkins, Rory        19        19        19        23        29
#> 3  Beech, Jonathon                                                  
#> 4     Betts, Eddie        18        13        16        22        12
#> 5      Brown, Luke        18        12        13         9        15
#> 6 Cameron, Charlie        23        17        16        16        13
#>   Disposals Disposals Disposals Disposals Disposals Disposals Disposals
#> 1        R6        R7        R8        R9       R10       R11       R12
#> 2        23        20        21        28        37        14        25
#> 3                                                                    15
#> 4        16        13         9        16        14        12        11
#> 5        17        13        20        25        16        12          
#> 6        13        14        10        18        13         8        13
#>   Disposals Disposals Disposals Disposals Disposals Disposals Disposals
#> 1       R14       R15       R16       R17       R18       R19       R20
#> 2        28        15        23        18        19        16        16
#> 3        12        11                                                  
#> 4        14        11        13        16         8                  16
#> 5        10        15        14        17        11        10        20
#> 6        15                  10        20         6         9        17
#>   Disposals Disposals Disposals Disposals Disposals Disposals Disposals
#> 1       R21       R22       R23        QF        PF        GF       Tot
#> 2        27        21        21        16        22        17       536
#> 3                                                                    38
#> 4         7        16        12        13        13         7       318
#> 5        17        17         9        20        10        13       353
#> 6        13        10        10        15        19        16       334

对列表和表中的<$ c $使用一些操作c> purrr 和 dplyr
您可以格式化具有2个标题的表:

Using some manipulation on the list and tables with purrr and dplyr, you can format your table which has 2 headers:

all_tables   <- website %>%
  html_nodes("table") %>%
  # do not let httr handles header automatically. 
  html_table(header = FALSE)

library(purrr)
#> 
#> Attachement du package : 'purrr'
#> The following object is masked from 'package:rvest':
#> 
#>     pluck
all_tables <- all_tables %>%
  # get the first column, first row to set the name for the list elements
  # pluck is a purrr function acting like x[[1]][1, 1] here
  lmap( ~ set_names(.x, nm = pluck(.x, 1, 1, 1))) %>%
  # For each table, set second line as header 
  # and delete first and second line
  map(~ set_names(.x, nm = .x[2, ]) %>% slice(-c(1, 2)))
str(all_tables_res, 1)
#> List of 23
#>  $ Disposals              :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Kicks                  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Marks                  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Handballs              :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Goals                  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Behinds                :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Hit Outs               :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Tackles                :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Rebounds               :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Inside 50s             :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Clearances             :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Clangers               :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Frees                  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Frees Against          :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Brownlow Votes         :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Contested Possessions  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Uncontested Possessions:Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Contested Marks        :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Marks Inside 50        :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ One Percenters         :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Bounces                :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Goal Assists           :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ % Played               :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:

You can now called any table of the website.

head(all_tables_res$Goals)
#> # A tibble: 6 x 27
#>             Player    R1    R2    R3    R4    R5    R6    R7    R8    R9
#>              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     Atkins, Rory     3     1     -     2     1     -     1     1     -
#> 2  Beech, Jonathon                                                      
#> 3     Betts, Eddie     4     3     3     6     3     1     3     2     3
#> 4      Brown, Luke     -     1     -     -     1     -     -     -     -
#> 5 Cameron, Charlie     2     1     -     1     2     2     2     -     4
#> 6     Crouch, Brad                             -     -     -     -     1
#> # ... with 17 more variables: R10 <chr>, R11 <chr>, R12 <chr>, R14 <chr>,
#> #   R15 <chr>, R16 <chr>, R17 <chr>, R18 <chr>, R19 <chr>, R20 <chr>,
#> #   R21 <chr>, R22 <chr>, R23 <chr>, QF <chr>, PF <chr>, GF <chr>,
#> #   Tot <chr>

这篇关于HTML表有两个标题时使用rvest包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆