用 R 中的 rvest 刮一张表标题不匹配的表 [英] scrape a table with rvest in R that has mismatch table heading

查看：33 发布时间：2021/7/14 18:41:22 r web-scraping rvest

本文介绍了用 R 中的 rvest 刮一张表标题不匹配的表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试刮这张桌子，这看起来非常简单.这是表格的网址:https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1>

这是我编码的内容:

url <-https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1"x = data.frame(read_html(url) %>%html_nodes("table") %>%html_table())

这可以正常工作，但给出了非常奇怪的两行标题，当我尝试添加 %>% slice(-1) 以取出顶行时，它说我不能，因为它是一个列表.真的很想弄清楚如何做到这一点.

解决方案

这是一个解决方案.解释如下.

库(rvest)图书馆(tidyverse)read_html(url) %>%html_nodes("table") %>%html_table(header = T) %>%简化()%>%第一个()%>%setNames(paste0(colnames(.), as.character(.[1,]))) %>%切片(-1)

glimpse() 的输出:

观察:25变量:16$排名<chr>1"、2"、3"、4"、5"、6"、7"、8"、9"、10"、11"、12"……$ Player <chr>拉马尔·杰克逊 QB - BAL"、Dak Prescott QB - DAL"、Deshaun W...$Opp<chr>@MIA"、NYG"、@NO"、@ARI"、@JAX"、@PHI"、PIT"、WAS"、...$ PassingYds <chr>324"、405"、268"、385"、378"、380"、341"、313"、248"……$ PassingTD <chr>"5", "4", "3", "3", "3", "3", "3", "3", "3", "3", "2", "2", "...$ PassingInt <chr>-"、-"、1"、-"、-"、-"、-"、-"、-"、1"、1"、1"、...$ RushingYds <chr>"6", "12", "40", "22", "2", "-", "-", "5", "24", "6", "13", "...$ RushingTD <chr>-"、-"、1"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ ReceivingRec <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ ReceivingYds <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$接收TD<chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ RetTD <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ MiscFumTD <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ Misc2PT <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、1"、-"、...$ FumLost <chr>-"、-"、-"、1"、-"、-"、-"、-"、-"、-"、-"、-"、...$ FantasyPoints <chr>33.56"、33.40"、30.72"、27.60"、27.32"、27.20"、25.64"……

说明
来自 ?html_table 文档:

<块引用>

html_table 目前做了一些假设:

没有跨多行的单元格
标题在第一行

通过在 html_table() 中设置 header = TRUE 可以解决部分问题.

问题的另一部分是标题单元格跨越两行，这是 html_table() 不期望的.

假设您不想丢失任一标题行中的信息，您可以:

使用simplify和first从你从html_table得到的列表中拉出数据框
使用 setNames 合并两个标题行(现在是数据框列和第一行)
使用 slice

I'm trying to scrape this table which seems like it would be super simple. Here's the url of the table: https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1

Here's what I coded:

url <- "https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1"
x = data.frame(read_html(url) %>% 
  html_nodes("table") %>% 
  html_table())

This works ok but gives really weird two row headers and when I try to add %>% slice(-1) to take out the top row it says I can't because it's a list. Would really like to figure out how to do this.

解决方案

Here's one solution. An explanation follows.

library(rvest)
library(tidyverse)

read_html(url) %>% 
  html_nodes("table") %>%  
  html_table(header = T) %>%
  simplify() %>% 
  first() %>% 
  setNames(paste0(colnames(.), as.character(.[1,]))) %>%
  slice(-1)

Output of glimpse():

Observations: 25
Variables: 16
$ Rank          <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"…
$ Player        <chr> "Lamar Jackson QB - BAL", "Dak Prescott QB - DAL", "Deshaun W…
$ Opp           <chr> "@MIA", "NYG", "@NO", "@ARI", "@JAX", "@PHI", "PIT", "WAS", "…
$ PassingYds    <chr> "324", "405", "268", "385", "378", "380", "341", "313", "248"…
$ PassingTD     <chr> "5", "4", "3", "3", "3", "3", "3", "3", "3", "3", "2", "2", "…
$ PassingInt    <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "1", "1", "1", "…
$ RushingYds    <chr> "6", "12", "40", "22", "2", "-", "-", "5", "24", "6", "13", "…
$ RushingTD     <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingRec  <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingYds  <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingTD   <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ RetTD         <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ MiscFumTD     <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ Misc2PT       <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "1", "-", "…
$ FumLost       <chr> "-", "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ FantasyPoints <chr> "33.56", "33.40", "30.72", "27.60", "27.32", "27.20", "25.64"…

Explanation
From ?html_table docs:

html_table currently makes a few assumptions:

No cells span multiple rows

Headers are in the first row

Part of your problem is solved by setting header = TRUE in html_table().

Another part of the problem is that the header cells span two rows, which html_table() does not expect.

Assuming you don't want to lose the information in either header row, you can:

Use simplify and first to pull out the data frame from the list you get from html_table
Use setNames to merge the two header rows (which are now the data frame columns and the first row)
Remove the first row (now redundant) with slice

这篇关于用 R 中的 rvest 刮一张表标题不匹配的表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用 R 中的 rvest 刮一张表标题不匹配的表 [英] scrape a table with rvest in R that has mismatch table heading

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用 R 中的 rvest 刮一张表标题不匹配的表 [英] scrape a table with rvest in R that has mismatch table heading

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭