从Wikipedia中的某个部分抓取表格 [英] Scraping a table from a section in Wikipedia

查看:98
本文介绍了从Wikipedia中的某个部分抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提出一种健壮的方法来提高每个赛季NFL球队的最终排名;很棒的是,有一个Wikipedia页面,其中包含所有此信息的链接.

I'm trying to come up with a robust way to scrape the final standings of the NFL teams in each season; wonderfully, there is a Wikipedia page with links to all this info.

不幸的是,最终排名表的存储方式/位置存在很多不一致之处(考虑到联赛结构的演变,这可能是意料之中的).

Unfortunately, there is a lot of inconsistency (perhaps to be expected, given the evolution of league structure) in how/where the final standings table is stored.

保存的宽限期应该是,相关表始终位于带有"Standings"字样的部分中.

The saving grace should be that the relevant table is always in a section with the word "Standings".

是否可以通过某种方式将段名称命名为grep,并且仅在其中提取table节点?

Is there some way I can grep a section name and only extract the table node(s) there?

以下是一些示例页面来演示结构:

Here are some sample pages to demonstrate the structure:

  • 1922赛季-只有一个分区,只有一张桌子;该表位于标题"Standings"下,并具有xpath //*[@id="mw-content-text"]/table[2]和CSS选择器#mw-content-text > table.wikitable.

  • 1922 season - Only one division, one table; table is found under heading "Standings" and has xpath //*[@id="mw-content-text"]/table[2] and CSS selector #mw-content-text > table.wikitable.

1950赛季-两个分区,两个桌子;两者都在最终排名"标题下找到.第一个具有xpath //*[@id="mw-content-text"]/div[2]/table/CSS #mw-content-text > div:nth-child(20) > table,第二个具有xpath //*[@id="mw-content-text"]/div[3]/table和选择器#mw-content-text > div:nth-child(21) > table.

1950 season - Two divisions, two tables; both found under heading "Final standings". First has xpath //*[@id="mw-content-text"]/div[2]/table / CSS #mw-content-text > div:nth-child(20) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(21) > table.

2000赛季-两次会议,6个分区,2个桌子;两者都在常规赛最终排名"标题下找到.第一个具有xpath //*[@id="mw-content-text"]/div[2]/table和选择器#mw-content-text > div:nth-child(16) > table,第二个具有xpath //*[@id="mw-content-text"]/div[3]/table和选择器#mw-content-text > div:nth-child(17) > table

2000 season - Two conferences, 6 divisions, two tables; both found under heading "Final regular season standings". First has xpath //*[@id="mw-content-text"]/div[2]/table and selector #mw-content-text > div:nth-child(16) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(17) > table

总结:

# season |                                   xpath |                                          css
-------------------------------------------------------------------------------------------------
#   1922 |     //*[@id="mw-content-text"]/table[2] |           #mw-content-text > table.wikitable
#   1950 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(20) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(21) > table
#   2000 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(16) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(17) > table

例如1922年进行抓取很容易:

Scraping, e.g., 1922 would be easy:

output <- read_html("https://en.wikipedia.org/wiki/1922_NFL_season") %>%
  html_node(xpath = '//*[@id="mw-content-text"]/table[2]') %>% whatever_else(...)

但是我没有看到可以在xpath中使用的任何模式,也没有可以用来概括这一点的CSS选择器,因此我不必进行80次单独的刮擦练习.

But I didn't see any pattern that I could use in the xpath nor the CSS selector that I could use to generalize this so I don't have to make 80 individual scraping exercises.

是否有任何健壮的方法来尝试擦除所有这些表,特别是考虑到所有表都位于一个标题下的重要见解,该标题将从grepl("standing", tolower(section_title))返回TRUE?

Is there any robust way to try and scrape all these tables, especially given the crucial insight that all the tables are located below a heading which would return TRUE from grepl("standing", tolower(section_title))?

推荐答案

通过使用lapply循环URL并使用精心选择的XPath选择器提取表,您可以一次抓取所有内容:

You can scrape everything at once by looping the URLs with lapply and pulling the tables with a carefully chosen XPath selector:

library(rvest)

lapply(paste0('https://en.wikipedia.org/wiki/', 1920:2015, '_NFL_season'), 
       function(url){ 
           url %>% read_html() %>% 
               html_nodes(xpath = '//span[contains(@id, "tandings")]/following::*[@title="Winning percentage" or text()="PCT"]/ancestor::table') %>% 
               html_table(fill = TRUE)
       })

XPath选择器寻找

The XPath selector looks for

  • //span[contains(@id, "tandings")]
    • 所有带有id且带有tandingsspan(例如身分",最终排名")
    • //span[contains(@id, "tandings")]
      • all spans with an id with tandings in it (e.g "Standings", "Final standings")
      • 在HTML中带有一个在其后的节点
        • 获胜百分比"的title属性
        • 或包含"PCT"
        • with a node after it in the HTML with
          • either a title attribute of "Winning Percentage"
          • or containing "PCT"
          • ,然后从该节点中选择树上的table节点.

          这篇关于从Wikipedia中的某个部分抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆