HTML表有两个标题时使用rvest包 [英] Using rvest package when HTML table has two headers
问题描述
我正在使用以下代码在AFL播放器数据上抓取HTML表:
I am using the following code to scrape an HTML table on AFL player data:
library(rvest)
website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html")
table <- website %>%
html_nodes("table") %>%
.[(1)] %>%
html_table()
结果表为34磅。 27个变量中的任意一个,但是 nrow(table)
或 ncol(table)
都返回NULL。这是因为数据帧中有两行标题是正确的吗?我希望能够基于各个列进行计算,但是以下给出了错误:
The resulting table is 34 obs. of 27 variables, however nrow(table)
or ncol(table)
both return NULL. Is it correct that this is because there are two rows of headers in the dataframe? I want to be able to do calculations based on individual columns however the following gives an error:
table[,1]
# Error in table[, 1] : incorrect number of dimensions
这会产生此错误以及如何解决?
Which does it produce this error and how can I solve it?
推荐答案
library(rvest)
#> Le chargement a nécessité le package : xml2
website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html")
在此网站上,您有几个表,每个链接一个显示在
上方的主页上。
在 html_nodes( tables)
的结果上使用 html_tables
允许您获取所有表
On this website, you have several tables, one per link displayed above
the printed table on the main page.
Using html_tables
on the result of html_nodes("tables")
allows you to get all the tables in a list at once.
all_tables <- website %>%
html_nodes("table") %>%
html_table()
str(all_tables, 1)
#> List of 23
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
#> $ :'data.frame': 34 obs. of 27 variables:
然后可以选择所需的表,但标题仍然不是
正确
You can then select the table you want but the header are still not right
head(all_tables[[1]])
#> Disposals Disposals Disposals Disposals Disposals Disposals
#> 1 Player R1 R2 R3 R4 R5
#> 2 Atkins, Rory 19 19 19 23 29
#> 3 Beech, Jonathon
#> 4 Betts, Eddie 18 13 16 22 12
#> 5 Brown, Luke 18 12 13 9 15
#> 6 Cameron, Charlie 23 17 16 16 13
#> Disposals Disposals Disposals Disposals Disposals Disposals Disposals
#> 1 R6 R7 R8 R9 R10 R11 R12
#> 2 23 20 21 28 37 14 25
#> 3 15
#> 4 16 13 9 16 14 12 11
#> 5 17 13 20 25 16 12
#> 6 13 14 10 18 13 8 13
#> Disposals Disposals Disposals Disposals Disposals Disposals Disposals
#> 1 R14 R15 R16 R17 R18 R19 R20
#> 2 28 15 23 18 19 16 16
#> 3 12 11
#> 4 14 11 13 16 8 16
#> 5 10 15 14 17 11 10 20
#> 6 15 10 20 6 9 17
#> Disposals Disposals Disposals Disposals Disposals Disposals Disposals
#> 1 R21 R22 R23 QF PF GF Tot
#> 2 27 21 21 16 22 17 536
#> 3 38
#> 4 7 16 12 13 13 7 318
#> 5 17 17 9 20 10 13 353
#> 6 13 10 10 15 19 16 334
对列表和表中的<$ c $使用一些操作c> purrr 和 dplyr
,
您可以格式化具有2个标题的表:
Using some manipulation on the list and tables with purrr
and dplyr
,
you can format your table which has 2 headers:
all_tables <- website %>%
html_nodes("table") %>%
# do not let httr handles header automatically.
html_table(header = FALSE)
library(purrr)
#>
#> Attachement du package : 'purrr'
#> The following object is masked from 'package:rvest':
#>
#> pluck
all_tables <- all_tables %>%
# get the first column, first row to set the name for the list elements
# pluck is a purrr function acting like x[[1]][1, 1] here
lmap( ~ set_names(.x, nm = pluck(.x, 1, 1, 1))) %>%
# For each table, set second line as header
# and delete first and second line
map(~ set_names(.x, nm = .x[2, ]) %>% slice(-c(1, 2)))
str(all_tables_res, 1)
#> List of 23
#> $ Disposals :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Kicks :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Marks :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Handballs :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Goals :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Behinds :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Hit Outs :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Tackles :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Rebounds :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Inside 50s :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Clearances :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Clangers :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Frees :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Frees Against :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Brownlow Votes :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Contested Possessions :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Uncontested Possessions:Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Contested Marks :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Marks Inside 50 :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ One Percenters :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Bounces :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ Goal Assists :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
#> $ % Played :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables:
You can now called any table of the website.
head(all_tables_res$Goals)
#> # A tibble: 6 x 27
#> Player R1 R2 R3 R4 R5 R6 R7 R8 R9
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Atkins, Rory 3 1 - 2 1 - 1 1 -
#> 2 Beech, Jonathon
#> 3 Betts, Eddie 4 3 3 6 3 1 3 2 3
#> 4 Brown, Luke - 1 - - 1 - - - -
#> 5 Cameron, Charlie 2 1 - 1 2 2 2 - 4
#> 6 Crouch, Brad - - - - 1
#> # ... with 17 more variables: R10 <chr>, R11 <chr>, R12 <chr>, R14 <chr>,
#> # R15 <chr>, R16 <chr>, R17 <chr>, R18 <chr>, R19 <chr>, R20 <chr>,
#> # R21 <chr>, R22 <chr>, R23 <chr>, QF <chr>, PF <chr>, GF <chr>,
#> # Tot <chr>
这篇关于HTML表有两个标题时使用rvest包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!