用 R 中的 rvest 刮一张表标题不匹配的表 [英] scrape a table with rvest in R that has mismatch table heading
问题描述
我正在尝试刮这张桌子,这看起来非常简单.这是表格的网址:https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1>
这是我编码的内容:
url <-https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1"x = data.frame(read_html(url) %>%html_nodes("table") %>%html_table())
这可以正常工作,但给出了非常奇怪的两行标题,当我尝试添加 %>% slice(-1) 以取出顶行时,它说我不能,因为它是一个列表.真的很想弄清楚如何做到这一点.
这是一个解决方案.解释如下.
库(rvest)图书馆(tidyverse)read_html(url) %>%html_nodes("table") %>%html_table(header = T) %>%简化()%>%第一个()%>%setNames(paste0(colnames(.), as.character(.[1,]))) %>%切片(-1)
glimpse()
的输出:
观察:25变量:16$排名<chr>1"、2"、3"、4"、5"、6"、7"、8"、9"、10"、11"、12"……$ Player <chr>拉马尔·杰克逊 QB - BAL"、Dak Prescott QB - DAL"、Deshaun W...$Opp<chr>@MIA"、NYG"、@NO"、@ARI"、@JAX"、@PHI"、PIT"、WAS"、...$ PassingYds <chr>324"、405"、268"、385"、378"、380"、341"、313"、248"……$ PassingTD <chr>"5", "4", "3", "3", "3", "3", "3", "3", "3", "3", "2", "2", "...$ PassingInt <chr>-"、-"、1"、-"、-"、-"、-"、-"、-"、1"、1"、1"、...$ RushingYds <chr>"6", "12", "40", "22", "2", "-", "-", "5", "24", "6", "13", "...$ RushingTD <chr>-"、-"、1"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ ReceivingRec <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ ReceivingYds <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$接收TD<chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ RetTD <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ MiscFumTD <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ Misc2PT <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、1"、-"、...$ FumLost <chr>-"、-"、-"、1"、-"、-"、-"、-"、-"、-"、-"、-"、...$ FantasyPoints <chr>33.56"、33.40"、30.72"、27.60"、27.32"、27.20"、25.64"……
说明
来自 ?html_table
文档:
html_table
目前做了一些假设:
- 没有跨多行的单元格
- 标题在第一行
通过在 html_table()
中设置 header = TRUE
可以解决部分问题.
问题的另一部分是标题单元格跨越两行,这是 html_table()
不期望的.
假设您不想丢失任一标题行中的信息,您可以:
- 使用
simplify
和first
从你从html_table
得到的列表中拉出数据框 - 使用
setNames
合并两个标题行(现在是数据框列和第一行) - 使用
slice
删除第一行(现在是多余的)
I'm trying to scrape this table which seems like it would be super simple. Here's the url of the table: https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1
Here's what I coded:
url <- "https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1"
x = data.frame(read_html(url) %>%
html_nodes("table") %>%
html_table())
This works ok but gives really weird two row headers and when I try to add %>% slice(-1) to take out the top row it says I can't because it's a list. Would really like to figure out how to do this.
Here's one solution. An explanation follows.
library(rvest)
library(tidyverse)
read_html(url) %>%
html_nodes("table") %>%
html_table(header = T) %>%
simplify() %>%
first() %>%
setNames(paste0(colnames(.), as.character(.[1,]))) %>%
slice(-1)
Output of glimpse()
:
Observations: 25
Variables: 16
$ Rank <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"…
$ Player <chr> "Lamar Jackson QB - BAL", "Dak Prescott QB - DAL", "Deshaun W…
$ Opp <chr> "@MIA", "NYG", "@NO", "@ARI", "@JAX", "@PHI", "PIT", "WAS", "…
$ PassingYds <chr> "324", "405", "268", "385", "378", "380", "341", "313", "248"…
$ PassingTD <chr> "5", "4", "3", "3", "3", "3", "3", "3", "3", "3", "2", "2", "…
$ PassingInt <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "1", "1", "1", "…
$ RushingYds <chr> "6", "12", "40", "22", "2", "-", "-", "5", "24", "6", "13", "…
$ RushingTD <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingRec <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingYds <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingTD <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ RetTD <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ MiscFumTD <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ Misc2PT <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "1", "-", "…
$ FumLost <chr> "-", "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ FantasyPoints <chr> "33.56", "33.40", "30.72", "27.60", "27.32", "27.20", "25.64"…
Explanation
From ?html_table
docs:
html_table
currently makes a few assumptions:
- No cells span multiple rows
- Headers are in the first row
Part of your problem is solved by setting header = TRUE
in html_table()
.
Another part of the problem is that the header cells span two rows, which html_table()
does not expect.
Assuming you don't want to lose the information in either header row, you can:
- Use
simplify
andfirst
to pull out the data frame from the list you get fromhtml_table
- Use
setNames
to merge the two header rows (which are now the data frame columns and the first row) - Remove the first row (now redundant) with
slice
这篇关于用 R 中的 rvest 刮一张表标题不匹配的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!