将复杂的 HTML 表格抓取到 R 中的 data.frame 中 [英] Scraping a complex HTML table into a data.frame in R

查看：33 发布时间：2021/7/14 18:33:57 r rvest

本文介绍了将复杂的 HTML 表格抓取到 R 中的 data.frame 中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将维基百科关于美国最高法院法官的数据加载到 R 中:

I am trying to load wikipedia's data on US Supreme Court Justices into R:

library(rvest)

html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])

[1] "Wilson, JamesJames Wilson"       "Jay, JohnJohn Jay†"             
[3] "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."     
[5] "Rutledge, JohnJohn Rutledge"     "Iredell, JamesJames Iredell"

问题在于数据格式错误.与我在实际 HTML 表格中看到的名字不同(James Wilson")，它实际上出现了两次，一次是Lastname, Firstname"，另一次是Firstname Lastname".

The problem is that the data is malformed. Rather than the name appearing how I see it in the actual HTML table ("James Wilson"), it is actually appearing twice, once as "Lastname, Firstname" and then once again as "Firstname Lastname".

原因是每个实际上都包含一个不可见的:

The reason is that each actually contains an invisible :

<td style="text-align:left;" class="">
    <span style="display:none" class="">Wilson, James</span>
    <a href="/wiki/James_Wilson" title="James Wilson">James Wilson</a>
</td>

对于包含数字数据的列也是如此.我猜这个额外的代码是对 HTML 表进行排序所必需的.但是，在尝试从 R 中的表创建 data.frame 时，我不清楚如何删除这些跨度.

The same is also true for the columns with numeric data. I am guessing that this extra code is necessary for sorting the HTML table. However, I am unclear how to remove those spans when trying to create a data.frame from the table in R.

推荐答案

可能是这样的

library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson"       "Jay, JohnJohn Jay†"              "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."     
# [5] "Rutledge, JohnJohn Rutledge"     "Iredell, JamesJames Iredel

removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson"    "John Jay†"       "William Cushing" "John Blair, Jr." "John Rutledge"   "James Iredell"

这篇关于将复杂的 HTML 表格抓取到 R 中的 data.frame 中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将复杂的 HTML 表格抓取到 R 中的 data.frame 中 [英] Scraping a complex HTML table into a data.frame in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将复杂的 HTML 表格抓取到 R 中的 data.frame 中 [英] Scraping a complex HTML table into a data.frame in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭