从Wikipedia中的某个部分抓取表格 [英] Scraping a table from a section in Wikipedia

查看：98 发布时间：2020/8/10 19:50:29 r xpath css-selectors rvest

本文介绍了从Wikipedia中的某个部分抓取表格的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试提出一种健壮的方法来提高每个赛季NFL球队的最终排名；很棒的是，有一个Wikipedia页面，其中包含所有此信息的链接.

I'm trying to come up with a robust way to scrape the final standings of the NFL teams in each season; wonderfully, there is a Wikipedia page with links to all this info.

不幸的是，最终排名表的存储方式/位置存在很多不一致之处(考虑到联赛结构的演变，这可能是意料之中的).

Unfortunately, there is a lot of inconsistency (perhaps to be expected, given the evolution of league structure) in how/where the final standings table is stored.

保存的宽限期应该是，相关表始终位于带有"Standings"字样的部分中.

The saving grace should be that the relevant table is always in a section with the word "Standings".

是否可以通过某种方式将段名称命名为grep，并且仅在其中提取table节点?

Is there some way I can grep a section name and only extract the table node(s) there?

以下是一些示例页面来演示结构:

Here are some sample pages to demonstrate the structure:

1922赛季-只有一个分区，只有一张桌子；该表位于标题"Standings"下，并具有xpath //*[@id="mw-content-text"]/table[2]和CSS选择器#mw-content-text > table.wikitable.

1922 season - Only one division, one table; table is found under heading "Standings" and has xpath //*[@id="mw-content-text"]/table[2] and CSS selector #mw-content-text > table.wikitable.

1950赛季-两个分区，两个桌子；两者都在最终排名"标题下找到.第一个具有xpath //*[@id="mw-content-text"]/div[2]/table/CSS #mw-content-text > div:nth-child(20) > table，第二个具有xpath //*[@id="mw-content-text"]/div[3]/table和选择器#mw-content-text > div:nth-child(21) > table.

1950 season - Two divisions, two tables; both found under heading "Final standings". First has xpath //*[@id="mw-content-text"]/div[2]/table / CSS #mw-content-text > div:nth-child(20) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(21) > table.

2000赛季-两次会议，6个分区，2个桌子；两者都在常规赛最终排名"标题下找到.第一个具有xpath //*[@id="mw-content-text"]/div[2]/table和选择器#mw-content-text > div:nth-child(16) > table，第二个具有xpath //*[@id="mw-content-text"]/div[3]/table和选择器#mw-content-text > div:nth-child(17) > table

2000 season - Two conferences, 6 divisions, two tables; both found under heading "Final regular season standings". First has xpath //*[@id="mw-content-text"]/div[2]/table and selector #mw-content-text > div:nth-child(16) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(17) > table

总结:

# season |                                   xpath |                                          css
-------------------------------------------------------------------------------------------------
#   1922 |     //*[@id="mw-content-text"]/table[2] |           #mw-content-text > table.wikitable
#   1950 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(20) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(21) > table
#   2000 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(16) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(17) > table

例如1922年进行抓取很容易:

Scraping, e.g., 1922 would be easy:

output <- read_html("https://en.wikipedia.org/wiki/1922_NFL_season") %>%
  html_node(xpath = '//*[@id="mw-content-text"]/table[2]') %>% whatever_else(...)

但是我没有看到可以在xpath中使用的任何模式，也没有可以用来概括这一点的CSS选择器，因此我不必进行80次单独的刮擦练习.

But I didn't see any pattern that I could use in the xpath nor the CSS selector that I could use to generalize this so I don't have to make 80 individual scraping exercises.

是否有任何健壮的方法来尝试擦除所有这些表，特别是考虑到所有表都位于一个标题下的重要见解，该标题将从grepl("standing", tolower(section_title))返回TRUE?

Is there any robust way to try and scrape all these tables, especially given the crucial insight that all the tables are located below a heading which would return TRUE from grepl("standing", tolower(section_title))?

推荐答案

通过使用lapply循环URL并使用精心选择的XPath选择器提取表，您可以一次抓取所有内容:

You can scrape everything at once by looping the URLs with lapply and pulling the tables with a carefully chosen XPath selector:

library(rvest)

lapply(paste0('https://en.wikipedia.org/wiki/', 1920:2015, '_NFL_season'), 
       function(url){ 
           url %>% read_html() %>% 
               html_nodes(xpath = '//span[contains(@id, "tandings")]/following::*[@title="Winning percentage" or text()="PCT"]/ancestor::table') %>% 
               html_table(fill = TRUE)
       })

XPath选择器寻找

The XPath selector looks for

//span[contains(@id, "tandings")]
- 所有带有id且带有tandings的span(例如身分"，最终排名")
- //span[contains(@id, "tandings")]
  - all spans with an id with tandings in it (e.g "Standings", "Final standings")
  - 在HTML中带有一个在其后的节点
    - 获胜百分比"的title属性
    - 或包含"PCT"
    - with a node after it in the HTML with
      - either a title attribute of "Winning Percentage"
      - or containing "PCT"
      - ，然后从该节点中选择树上的table节点.
      这篇关于从Wikipedia中的某个部分抓取表格的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从Wikipedia中的某个部分抓取表格 [英] Scraping a table from a section in Wikipedia

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从Wikipedia中的某个部分抓取表格 [英] Scraping a table from a section in Wikipedia

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭