从Wikipedia中的某个部分抓取表格 [英] Scraping a table from a section in Wikipedia
问题描述
我正在尝试提出一种健壮的方法来提高每个赛季NFL球队的最终排名;很棒的是,有一个Wikipedia页面,其中包含所有此信息的链接.
I'm trying to come up with a robust way to scrape the final standings of the NFL teams in each season; wonderfully, there is a Wikipedia page with links to all this info.
不幸的是,最终排名表的存储方式/位置存在很多不一致之处(考虑到联赛结构的演变,这可能是意料之中的).
Unfortunately, there is a lot of inconsistency (perhaps to be expected, given the evolution of league structure) in how/where the final standings table is stored.
保存的宽限期应该是,相关表始终位于带有"Standings"字样的部分中.
The saving grace should be that the relevant table is always in a section with the word "Standings".
是否可以通过某种方式将段名称命名为grep
,并且仅在其中提取table
节点?
Is there some way I can grep
a section name and only extract the table
node(s) there?
以下是一些示例页面来演示结构:
Here are some sample pages to demonstrate the structure:
-
1922赛季-只有一个分区,只有一张桌子;该表位于标题"Standings"下,并具有xpath
//*[@id="mw-content-text"]/table[2]
和CSS选择器#mw-content-text > table.wikitable
.
1922 season - Only one division, one table; table is found under heading "Standings" and has xpath
//*[@id="mw-content-text"]/table[2]
and CSS selector#mw-content-text > table.wikitable
.
1950赛季-两个分区,两个桌子;两者都在最终排名"标题下找到.第一个具有xpath //*[@id="mw-content-text"]/div[2]/table
/CSS #mw-content-text > div:nth-child(20) > table
,第二个具有xpath //*[@id="mw-content-text"]/div[3]/table
和选择器#mw-content-text > div:nth-child(21) > table
.
1950 season - Two divisions, two tables; both found under heading "Final standings". First has xpath //*[@id="mw-content-text"]/div[2]/table
/ CSS #mw-content-text > div:nth-child(20) > table
, second has xpath //*[@id="mw-content-text"]/div[3]/table
and selector #mw-content-text > div:nth-child(21) > table
.
2000赛季-两次会议,6个分区,2个桌子;两者都在常规赛最终排名"标题下找到.第一个具有xpath //*[@id="mw-content-text"]/div[2]/table
和选择器#mw-content-text > div:nth-child(16) > table
,第二个具有xpath //*[@id="mw-content-text"]/div[3]/table
和选择器#mw-content-text > div:nth-child(17) > table
2000 season - Two conferences, 6 divisions, two tables; both found under heading "Final regular season standings". First has xpath //*[@id="mw-content-text"]/div[2]/table
and selector #mw-content-text > div:nth-child(16) > table
, second has xpath //*[@id="mw-content-text"]/div[3]/table
and selector #mw-content-text > div:nth-child(17) > table
总结:
# season | xpath | css
-------------------------------------------------------------------------------------------------
# 1922 | //*[@id="mw-content-text"]/table[2] | #mw-content-text > table.wikitable
# 1950 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(20) > table
# | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(21) > table
# 2000 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(16) > table
# | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(17) > table
例如1922年进行抓取很容易:
Scraping, e.g., 1922 would be easy:
output <- read_html("https://en.wikipedia.org/wiki/1922_NFL_season") %>%
html_node(xpath = '//*[@id="mw-content-text"]/table[2]') %>% whatever_else(...)
但是我没有看到可以在xpath中使用的任何模式,也没有可以用来概括这一点的CSS选择器,因此我不必进行80次单独的刮擦练习.
But I didn't see any pattern that I could use in the xpath nor the CSS selector that I could use to generalize this so I don't have to make 80 individual scraping exercises.
是否有任何健壮的方法来尝试擦除所有这些表,特别是考虑到所有表都位于一个标题下的重要见解,该标题将从grepl("standing", tolower(section_title))
返回TRUE
?
Is there any robust way to try and scrape all these tables, especially given the crucial insight that all the tables are located below a heading which would return TRUE
from grepl("standing", tolower(section_title))
?
推荐答案
通过使用lapply
循环URL并使用精心选择的XPath选择器提取表,您可以一次抓取所有内容:
You can scrape everything at once by looping the URLs with lapply
and pulling the tables with a carefully chosen XPath selector:
library(rvest)
lapply(paste0('https://en.wikipedia.org/wiki/', 1920:2015, '_NFL_season'),
function(url){
url %>% read_html() %>%
html_nodes(xpath = '//span[contains(@id, "tandings")]/following::*[@title="Winning percentage" or text()="PCT"]/ancestor::table') %>%
html_table(fill = TRUE)
})
XPath选择器寻找
The XPath selector looks for
-
//span[contains(@id, "tandings")]
- 所有带有
id
且带有tandings
的span
(例如身分",最终排名")
//span[contains(@id, "tandings")]
- all
span
s with anid
withtandings
in it (e.g "Standings", "Final standings")
- 在HTML中带有一个在其后的节点
- 获胜百分比"的
title
属性 - 或包含"PCT"
- with a node after it in the HTML with
- either a
title
attribute of "Winning Percentage" - or containing "PCT"
- ,然后从该节点中选择树上的
table
节点.
这篇关于从Wikipedia中的某个部分抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- either a
- 获胜百分比"的
- all
- 所有带有