在 R 中抓取basketball-reference.com(XML 包未完全正常工作) [英] Scraping basketball-reference.com in R (XML package not fully working)
问题描述
我已经在 R 中使用readHTMLtable"使用 XML 包在 R 中抓取了篮球参考的各种页面,没有任何问题,但现在我有了一个.当我尝试抓取玩家页面的拆分部分时,它只返回表格的第一行而不是全部.
I have been scraping various pages of basketball-ref for a while now in R with the XML package using "readHTMLtable" without any issues, but now I have one. When I try to scrape the splits section of a player's page, it only return the first line of the table not all.
例如:
URL="http://www.basketball-reference.com/players/j/jamesle01/splits/"
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
这在表格中只给了我一行,第一行.但是我想要所有的行.我认为问题在于表格中有多个标题,但我不知道如何解决这个问题.
this gives me only one row in the table, the first one. I want all the rows however. I think the problem is that there are multiple headers in the table, but I'm not sure how to fix that.
谢谢
推荐答案
您可以过滤表体:
library(XML)
appURL <- "http://www.basketball-reference.com/players/j/jamesle01/splits/"
doc <- htmlParse(appURL)
appTables <- doc['//table/tbody']
appTables
将是一个包含 12 个没有标题的表的列表.要检索标题,您可以从 thead
获取它们:
appTables
would be a list containing the 12 tables sans headers. To retrieve the headers you can get them from the thead
:
myHeaders <- unlist(doc["//thead/tr[2]/th", fun = xmlValue])
myTables <- lapply(appTables, readHTMLTable, header = myHeaders)
您可以使用以下方法将数据放入一张大表中:
You can put the data in one big table using something like:
bigTable <- do.call(rbind, myTables)
> head(bigTable)
Split Value G GS MP FG FGA 3P 3PA FT FTA ORB TRB AST STL BLK TOV PF PTS FG% 3P% FT%
1 Total 871 870 34364 8582 17289 1184 3462 5553 7432 1049 6239 6011 1483 698 2906 1615 23901 .496 .342 .747
2 Place Home 441 440 17167 4201 8307 567 1627 2805 3706 507 3133 3082 711 387 1413 744 11774 .506 .348 .757
3 Road 430 430 17197 4381 8982 617 1835 2748 3726 542 3106 2929 772 311 1493 871 12127 .488 .336 .738
4 All-Star Pre 569 568 22349 5544 11167 759 2205 3576 4791 655 4051 3966 967 456 1940 1087 15423 .496 .344 .746
5 Post 302 302 12015 3038 6122 425 1257 1977 2641 394 2188 2045 516 242 966 528 8478 .496 .338 .749
6 Result Win 572 571 22196 5783 11094 772 2154 3749 4931 677 4241 4132 1032 496 1793 1016 16087 .521 .358 .760
TS% USG% ORtg DRtg MP PTS TRB AST
1 .581 31.9 116 103 39.5 27.4 7.2 6.9
2 .592 30.9 118 102 38.9 26.7 7.1 7.0
3 .571 32.8 114 105 40.0 28.2 7.2 6.8
4 .581 31.7 116 103 39.3 27.1 7.1 7.0
5 .582 32.2 117 104 39.8 28.1 7.2 6.8
6 .606 31.7 122 99 38.8 28.1 7.4 7.2
这篇关于在 R 中抓取basketball-reference.com(XML 包未完全正常工作)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!