用 R 中的 rvest 刮一张表标题不匹配的表 [英] scrape a table with rvest in R that has mismatch table heading

查看:33
本文介绍了用 R 中的 rvest 刮一张表标题不匹配的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试刮这张桌子,这看起来非常简单.这是表格的网址:https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1>

这是我编码的内容:

url <-https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1"x = data.frame(read_html(url) %>%html_nodes("table") %>%html_table())

这可以正常工作,但给出了非常奇怪的两行标题,当我尝试添加 %>% slice(-1) 以取出顶行时,它说我不能,因为它是一个列表.真的很想弄清楚如何做到这一点.

解决方案

这是一个解决方案.解释如下.

库(rvest)图书馆(tidyverse)read_html(url) %>%html_nodes("table") %>%html_table(header = T) %>%简化()%>%第一个()%>%setNames(paste0(colnames(.), as.character(.[1,]))) %>%切片(-1)

glimpse() 的输出:

观察:25变量:16$排名<chr>1"、2"、3"、4"、5"、6"、7"、8"、9"、10"、11"、12"……$ Player <chr>拉马尔·杰克逊 QB - BAL"、Dak Prescott QB - DAL"、Deshaun W...$Opp<chr>@MIA"、NYG"、@NO"、@ARI"、@JAX"、@PHI"、PIT"、WAS"、...$ PassingYds <chr>324"、405"、268"、385"、378"、380"、341"、313"、248"……$ PassingTD <chr>"5", "4", "3", "3", "3", "3", "3", "3", "3", "3", "2", "2", "...$ PassingInt <chr>-"、-"、1"、-"、-"、-"、-"、-"、-"、1"、1"、1"、...$ RushingYds <chr>"6", "12", "40", "22", "2", "-", "-", "5", "24", "6", "13", "...$ RushingTD <chr>-"、-"、1"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ ReceivingRec <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ ReceivingYds <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$接收TD<chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ RetTD <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ MiscFumTD <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、...$ Misc2PT <chr>-"、-"、-"、-"、-"、-"、-"、-"、-"、-"、1"、-"、...$ FumLost <chr>-"、-"、-"、1"、-"、-"、-"、-"、-"、-"、-"、-"、...$ FantasyPoints <chr>33.56"、33.40"、30.72"、27.60"、27.32"、27.20"、25.64"……

说明
来自 ?html_table 文档:

<块引用>

html_table 目前做了一些假设:

  • 没有跨多行的单元格
  • 标题在第一行

通过在 html_table() 中设置 header = TRUE 可以解决部分问题.

问题的另一部分是标题单元格跨越两行,这是 html_table() 不期望的.

假设您不想丢失任一标题行中的信息,您可以:

  1. 使用simplifyfirst从你从html_table得到的列表中拉出数据框
  2. 使用 setNames 合并两个标题行(现在是数据框列和第一行)
  3. 使用 slice
  4. 删除第一行(现在是多余的)

I'm trying to scrape this table which seems like it would be super simple. Here's the url of the table: https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1

Here's what I coded:

url <- "https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1"
x = data.frame(read_html(url) %>% 
  html_nodes("table") %>% 
  html_table())

This works ok but gives really weird two row headers and when I try to add %>% slice(-1) to take out the top row it says I can't because it's a list. Would really like to figure out how to do this.

解决方案

Here's one solution. An explanation follows.

library(rvest)
library(tidyverse)

read_html(url) %>% 
  html_nodes("table") %>%  
  html_table(header = T) %>%
  simplify() %>% 
  first() %>% 
  setNames(paste0(colnames(.), as.character(.[1,]))) %>%
  slice(-1) 

Output of glimpse():

Observations: 25
Variables: 16
$ Rank          <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"…
$ Player        <chr> "Lamar Jackson QB - BAL", "Dak Prescott QB - DAL", "Deshaun W…
$ Opp           <chr> "@MIA", "NYG", "@NO", "@ARI", "@JAX", "@PHI", "PIT", "WAS", "…
$ PassingYds    <chr> "324", "405", "268", "385", "378", "380", "341", "313", "248"…
$ PassingTD     <chr> "5", "4", "3", "3", "3", "3", "3", "3", "3", "3", "2", "2", "…
$ PassingInt    <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "1", "1", "1", "…
$ RushingYds    <chr> "6", "12", "40", "22", "2", "-", "-", "5", "24", "6", "13", "…
$ RushingTD     <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingRec  <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingYds  <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingTD   <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ RetTD         <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ MiscFumTD     <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ Misc2PT       <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "1", "-", "…
$ FumLost       <chr> "-", "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ FantasyPoints <chr> "33.56", "33.40", "30.72", "27.60", "27.32", "27.20", "25.64"…

Explanation
From ?html_table docs:

html_table currently makes a few assumptions:

  • No cells span multiple rows
  • Headers are in the first row

Part of your problem is solved by setting header = TRUE in html_table().

Another part of the problem is that the header cells span two rows, which html_table() does not expect.

Assuming you don't want to lose the information in either header row, you can:

  1. Use simplify and first to pull out the data frame from the list you get from html_table
  2. Use setNames to merge the two header rows (which are now the data frame columns and the first row)
  3. Remove the first row (now redundant) with slice

这篇关于用 R 中的 rvest 刮一张表标题不匹配的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆