如何使用 rvest() 获取表 [英] How to get table using rvest()

查看:51
本文介绍了如何使用 rvest() 获取表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 rvest 包从 Pro Football Reference 网站获取一些数据.首先,让我们从这个 url http://www.pro-football-reference.com/years/2015/games.htm

I want to grab some data from Pro Football Reference website using the rvest package. First, let's grab results for all games played in 2015 from this url http://www.pro-football-reference.com/years/2015/games.htm

library("rvest")
library("dplyr")

#grab table info
url <- "http://www.pro-football-reference.com/years/2015/games.htm"
urlHtml <- url %>% read_html() 
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()

你会这样做吗?:)

dat 可以清理一下.其中两个变量的名称似乎有空格.加上标题行在每周之间重复.

dat could be cleaned up a bit. Two of the variables seem to have blanks for names. Plus the header row is repeated between each week.

colnames(dat) <- c("week", "day", "date", "winner", "at", "loser", 
                   "box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL")

dat2 <- dat %>% filter(!(box == ""))
head(dat2)

看起来不错!

现在让我们看一个单独的游戏.在上面的网页上,单击表格第一行中的Boxscore":9 月 10 日在新英格兰和匹兹堡之间进行的比赛.这将我们带到这里:http://www.pro-football-reference.com/boxscores/201509100nwe.htm.

Now let's look at an individual game. At the webpage above, click on "Boxscore" in the very first row of the table: The Sept 10th game played between New England and Pittsburgh. That takes us here: http://www.pro-football-reference.com/boxscores/201509100nwe.htm.

我想获取每个玩家的个人快照计数(大约在页面的一半).很确定这些将是我们的前两行代码:

I want to grab the individual snap counts for each player (about half way down the page). Pretty sure these will be our first two lines of code:

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()

但现在我不知道如何抓取我想要的特定表.我使用 Selector Gadget 突出显示 Patriots 快照计数表.为此,我在多个位置单击表格,然后取消单击"突出显示的其他表格.我最终得到了一条路径:

But now I can't figure out how to grab the specific table I want. I use the Selector Gadget to highlight the table of Patriots snap counts. I do this by clicking on the table in several places, then 'unclicking' the other tables that were highlighted. I end up with a path of:

#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left

每次尝试都返回 {xml_nodeset (0)}

gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right")
gameHtml %>% html_nodes("#home_snap_counts")

也许让我们尝试使用 xpath.所有这些尝试也返回 {xml_nodeset (0)}

Maybe let's try using xpath. All of these attempts also return {xml_nodeset (0)}

gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))] | //*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "tooltip", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]')

我怎样才能抢到那张桌子?我还要指出,当我在 Google Chrome 中执行查看页面源代码"时,我想要的表格几乎似乎被注释掉了?也就是说,它们以绿色输入,而不是通常的红色/黑色/蓝色配色方案.我们首先提取的游戏结果表并非如此.该表的查看页面源代码"是通常的红/黑/蓝配色方案.绿色是否表明是什么阻止了我能够获取此快照计数表?

How can I grab that table? I'll also point out, when I do "View Page Source" in Google Chrome, the tables I want almost seem to be commented out? That is, they're typed in green, instead of the usual red/black/blue color scheme. That is not the case for the table of game results we pulled first. "View Page Source" for that table is the usual red/black/blue color scheme. Is the greenness indicative of what's preventing me from being able to grab this snap count table?

谢谢!

推荐答案

您要查找的信息在运行时以编程方式显示.一种解决方案是使用 RSelenium.

The information you are looking for is programmatically display at run time. One solution is to use RSelenium.

在查看网页源代码时,表格中的信息存储在代码中,但由于表格存储为注释而被隐藏.这是我删除评论标记并正常重新处理页面的解决方案.

While looking at the web page's source, the information from the tables are stored in the code but are hidden because the tables are stored as comments. Here is my solution where I remove the comments markers and reprocess the page normally.

我将文件保存到工作目录,然后使用 readLines 函数读取文件.
现在我搜索 html 的开始和结束注释标志,然后删除它们.我再次保存文件(减少注释标志),以便为选定的表重新读取和处理文件.

I saved the file to the working directory and then read the file in using the readLines function.
Now I search for the html’s begin and end comment flags and then remove them. I save the file a second time (less the comment flags) in order to reread and process the file for the selected tables.

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
gameHtml %>% html_nodes("tbody")

#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nfl.xml")

#Find and remove comments
lines<-readLines("nfl.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nfl2.xml")

#Read the file back in and process normally
body<-read_html("nfl2.xml")
html_table(html_nodes(body, "table")[29])

#extract the attributes and find the attribute of interest
a<-html_attrs(html_nodes(body, "table"))

#find the tables of interest.
homesnap<-which(sapply(a, function(x){x[2]})=="home_snap_counts")
html_table(html_nodes(body, "table")[homesnap])

visitsnap<-which(sapply(a, function(x){x[2]})=="vis_snap_counts")
html_table(html_nodes(body, "table")[visitsnap])

这篇关于如何使用 rvest() 获取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆