从R中的多个网页上的表格中刮取数据(足球运动员) [英] Scraping data from tables on multiple web pages in R (football players)

查看:347
本文介绍了从R中的多个网页上的表格中刮取数据(足球运动员)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为一所学校开展一个项目,我需要收集个人NCAA足球队员的职业生涯统计数据。每个玩家的数据都是这种格式。



http://www.sports-reference.com/cfb/​​players/ryan-aplin-1.html



我找不到所有球员的总数,所以我需要一页接一页地排出每一个传球得分的最后一排。接收等。html表格



每位玩家都被他们的姓氏分类,每个字母的链接都在这里。



http://www.sports-reference.com/cfb/​​players/



例如,在这里可以找到姓A的每个玩家。

http://www.sports-reference.com/cfb/​​players/a-index.html

这是我第一次真正进入数据挖掘,所以我试图找到类似的答案。我发现的最接近的答案是这个问题



我相信我可以使用非常类似的东西,在其中切换页码和收集到的玩家的名字。但是,我不确定如何将其更改为查找播放器名称而不是页码。

Samuel L. Ventura最近也就NFL数据提供了一个关于数据挖掘的讨论,可以发现这里

编辑:



Ben很有帮助,并提供了一些很棒的代码。第一部分工作得很好,但是当我尝试运行第二部分时,我遇到了这个问题。

 > #未列入单个字符向量
>链接< - 未列出(链接)
> #转到列表中的每个URL并从表格中删除所有数据
> #这需要一些时间...不要打断它!
> all_tables< - lapply(links,readHTMLTable,stringsAsFactors = FALSE)
UseMethod中的错误(xmlNamespaceDefinitions):
没有应用于类NULL对象的'xmlNamespaceDefinitions'的适用方法
> #将玩家名称放入列表中,以便我们知道数据属于谁
> #将网址中的名称提取到其统计信息页面...
> toMatch< - c(http://www.sports-reference.com/cfb/​​players/,-1.html)
> player_names< - unique(gsub(paste(toMatch,collapse =|),,links))
错误:无法分配大小为512的矢量Kb
> #将玩家名称分配给表格列表
> names(all_tables)< - player_names
错误:找不到对象'player_names'
>修复(inx_page)
编辑错误(名称,文件,标题,编辑器):
意外的'<'出现在第1行
使用类似
x的命令< - 编辑()
来恢复
另外:警告信息:
在edit.default(name,file,title,editor = defaultEditor)中:
deparse可能不完整

这可能是由于内存不足导致的错误(我目前使用的计算机上只有4GB)。虽然我不明白这个错误,但是我不明白这个错误。

 > all_tables<  -  lapply(links,readHTMLTable,stringsAsFactors = FALSE)
UseMethod中的错误(xmlNamespaceDefinitions):
没有应用于类NULL对象的'xmlNamespaceDefinitions'的适用方法

通过我的其他数据集,我的球员真的只能回到2007年。如果有什么方法可以让人们从2007年起可能有助于缩小数据。如果我有一个我想要取名的人的名单,我可以在

 链接[[i]]中替换lnk <  -  paste0(http://www.sports-reference.com,lnk)

只有我需要的球员?

解决方案

以下是如何轻松获得所有球员的所有表中的所有数据网页...



首先制作所有玩家网页的网址列表...

 需要(RCurl); require(XML)
n< - 长度(字母)
#预分配列表以填充
链接< - vector(list,length = n)
for我在1:n){
print(i)#跟踪函数高达
#获取az索引页面的每页上的所有html
inx_page< - htmlParse (getURI(paste0(http://www.sports-reference.com/cfb/​​players/,letters [i],-index.html)))
#为每个玩家索引页
lnk< - unname(xpathSApply(inx_page,// a / @ href))
#首先跳过63和最后10个链接,因为它们在每个页面上都是常量
lnk < - lnk [-c(1:63,(length(lnk)-10):length(lnk))]
#只保留去玩家的链接(不包括学校)
lnk< - lnk [grep(players,lnk)]
#现在我们有一个列表,指向该索引页面
#中所有玩家的所有URL,但这些URL不完整,所以让我们完成它们我们可以从
#中任意使用它们
links [[i]]< - paste0(http://www.sports-reference.com,lnk)

#未列入单个字符向量
链接< - 未列出(链接)

现在我们有一个67,000个URL的向量(看起来像很多玩家,这是对的吗?),所以:

第二,在每个URL获取数据的表格,如下所示:

 #转到列表中的每个URL并将所有数据从表
#这将需要一些时间...不要打断它!
#在这里开始编辑1 - 就这样你可以看到发生了什么变化
#预先分配列表
all_tables< - vector(list,length =(length(links)))
for(i in 1:length(links)){
print(i)
#错误处理 - 如果出现错误,跳至下一个网址
result< - try(
all_tables [[i]]< - readHTMLTable(links [i],stringsAsFactors = FALSE)
); if(class(result)==try-error)next;
}
#end edit1 here
#将玩家名称放入列表中,以便我们知道数据属于谁
#将URL中的名称提取到其统计页面...
toMatch <-c(http://www.sports-reference.com/cfb/​​players/,-1.html)
player_names< - unique(gsub(paste(toMatch ,collapse =|),,links))
#将玩家名称分配给表格列表
名称(all_tables)< - player_names

结果如下所示(这只是输出的一部分):

  all_tables 
$`neli-aasa`
$`neli-aasa` $ defence
年份学校班级Pos Solo Ast亏损Sk Int平均值TD PD FR FR Yds TD FF
1 * 2007犹他州MWC FR DL 2 1 3 0.0 0.0 0 0 0 0 0 0 0 0
2 * 2010犹他州MWC SR DL 4 4 8 2.5 1.5 0 0 0 1 0 0 0 0

$`neli-aasa` $ kick_ret
年份学校班级职位退休平均退休年龄平均道明元b $ b 1 * 2007犹他州MWC FR DL 0 0 0 0 0 0
2 * 2010犹他州MWC SR DL 2 24 12.0 0 0 0 0

$`neli-aasa` $收到
年份学校班级Pos Rec Yds Avg TD Att平均每场比赛平均赔率平均赔率
1 * 2007犹他州MWC罚款法DL 1 41 41.0 0 0 0 1 1 41 41.0 0
2 * 2010犹他州MWC SR DL 0 0 0 0 0 0 0 0 0

最后,让我们假设我们只想看看传递的表...

 #只显示传球表
传球< - lapply(all_tables,函数(i)我传球)
#但在这里有大量的NULL,而不是一个方便的格式,所以...
传递< - do.call(rbind,传递)

我们最终得到一个可供进一步分析的数据框架(也只是一个片段)...

 年份学校配置类别位置Cmp Att Pct Yds Y / A AY / A TD Int Rate Rate 
james-aaron 1978 Air Force Ind QB 28 56 50.0 316 5.6 3.6 1 3 92.6
jeff-aaron.1 2000阿拉巴马州伯明翰CUSA JR QB 100 182 54.9 1135 6.2 6.0 5 3 113.1
jeff-aaron.2 2001阿拉巴马州伯明翰CUSA SR QB 77 148 52.0 828 5.6 4.3 4 6 99.8


I'm working on a project for school where I need to collect the career statistics for individual NCAA football players. The data for each player is in this format.

http://www.sports-reference.com/cfb/players/ryan-aplin-1.html

I cannot find an aggregate of all players so I need to go page by page and pull out the bottom row of each passing scoring Rushing & receiving etc. html table

Each player is catagorized by their last name with links to each alphabet going here.

http://www.sports-reference.com/cfb/players/

For instance, each player with the last name A is found here.

http://www.sports-reference.com/cfb/players/a-index.html

This is my first time really getting into data scraping so I tried to find similar questions with answers. The closest answer I found was this question

I believe I could use something very similar where I switch page number with the collected player's name. However, I'm not sure how to change it to look for player name instead of page number.

Samuel L. Ventura also gave a talk about data scraping for NFL data recently that can be found here.

EDIT:

Ben was really helpful and provided some great code. The first part works really well, however when I attempt to run the second part I run into this.

> # unlist into a single character vector
> links <- unlist(links)
> # Go to each URL in the list and scrape all the data from the tables
> # this will take some time... don't interrupt it! 
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") : 
 no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
> # Put player names in the list so we know who the data belong to
> # extract names from the URLs to their stats page...
> toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
> player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
Error: cannot allocate vector of size 512 Kb
> # assign player names to list of tables
> names(all_tables) <- player_names
Error: object 'player_names' not found
> fix(inx_page)
Error in edit(name, file, title, editor) : 
  unexpected '<' occurred on line 1
 use a command like
 x <- edit()
 to recover
In addition: Warning message:
In edit.default(name, file, title, editor = defaultEditor) :
  deparse may be incomplete

This could be an error due to not having sufficient memory (only 4gb on computer I am currently using). Although I do not understand the error

    > all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") : 
 no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"

Looking through my other datasets my players really only go back to 2007. If there would be some way to pull just people from 2007 onwards that may help shrink the data. If I had a list of people whose names I wanted to pull could I just replace the lnk in

 links[[i]] <- paste0("http://www.sports-reference.com", lnk)

with only the players that I need?

解决方案

Here's how you can easily get all the data in all the tables on all the player pages...

First make a list of the URLs for all the players' pages...

require(RCurl); require(XML)
n <- length(letters) 
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
  print(i) # keep track of what the function is up to
  # get all html on each page of the a-z index pages
  inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
  # scrape URLs for each player from each index page
  lnk <- unname(xpathSApply(inx_page, "//a/@href"))
  # skip first 63 and last 10 links as they are constant on each page
  lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
  # only keep links that go to players (exclude schools)
  lnk <- lnk[grep("players", lnk)]
  # now we have a list of all the URLs to all the players on that index page
  # but the URLs are incomplete, so let's complete them so we can use them from 
  # anywhere
  links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)

Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:

Second, scrape all the tables at each URL to get their data, like so:

# Go to each URL in the list and scrape all the data from the tables
# this will take some time... don't interrupt it!
# start edit1 here - just so you can see what's changed
    # pre-allocate list
all_tables <- vector("list", length = (length(links)))
for(i in 1:length(links)){
  print(i)
  # error handling - skips to next URL if it gets an error
  result <- try(
    all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
  ); if(class(result) == "try-error") next;
}
# end edit1 here
# Put player names in the list so we know who the data belong to
# extract names from the URLs to their stats page...
toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
# assign player names to list of tables
names(all_tables) <- player_names

The result looks like this (this is just a snippet of the output):

all_tables
$`neli-aasa`
$`neli-aasa`$defense
   Year School Conf Class Pos Solo Ast Tot Loss  Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007   Utah  MWC    FR  DL    2   1   3  0.0 0.0   0   0      0  0  0   0  0  0
2 *2010   Utah  MWC    SR  DL    4   4   8  2.5 1.5   0   0      0  1  0   0  0  0

$`neli-aasa`$kick_ret
   Year School Conf Class Pos Ret Yds  Avg TD Ret Yds Avg TD
1 *2007   Utah  MWC    FR  DL   0   0       0   0   0      0
2 *2010   Utah  MWC    SR  DL   2  24 12.0  0   0   0      0

$`neli-aasa`$receiving
   Year School Conf Class Pos Rec Yds  Avg TD Att Yds Avg TD Plays Yds  Avg TD
1 *2007   Utah  MWC    FR  DL   1  41 41.0  0   0   0      0     1  41 41.0  0
2 *2010   Utah  MWC    SR  DL   0   0       0   0   0      0     0   0       0

Finally, let's say we just want to look at the passing tables...

# just show passing tables
passing <- lapply(all_tables, function(i) i$passing)
# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)

And we end up with a data frame that is ready for further analyses (also just a snippet)...

             Year             School Conf Class Pos Cmp Att  Pct  Yds Y/A AY/A TD Int  Rate
james-aaron  1978          Air Force  Ind        QB  28  56 50.0  316 5.6  3.6  1   3  92.6
jeff-aaron.1 2000 Alabama-Birmingham CUSA    JR  QB 100 182 54.9 1135 6.2  6.0  5   3 113.1
jeff-aaron.2 2001 Alabama-Birmingham CUSA    SR  QB  77 148 52.0  828 5.6  4.3  4   6  99.8

这篇关于从R中的多个网页上的表格中刮取数据(足球运动员)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆