2.12 中的 XML 包错误,但不是 2.10 [英] XML Package error in 2.12, but not 2.10

查看:23
本文介绍了2.12 中的 XML 包错误,但不是 2.10的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中使用 XML 包从页面读取 HTML 表.在 2.12.1 中,我收到以下错误:

I am using the XML package in R to read the HTML tables from a page. In 2.12.1, I am getting the following error:

Error in names(ans) = header : 
  'names' attribute [24] must be the same length as the vector [19]

但是,当我在 2.10 中运行相同的代码片段时,没有错误并且一切都解析(几乎)正常.我这么说几乎是因为列名取自表格的第一行,但我可以解决这个问题.

However, when I run the same code snippet in 2.10, there are no errors and everything parses (almost) fine. I say almost because the column names are taken from the first row of the table, but I can get around that.

这是我的代码:

## load the libraries
library(XML)

## set the season
SEASON <- "2011"

## create the URL
URL <- paste("http://www.hockey-reference.com/leagues/NHL_", SEASON, "_goalies.html", sep="")

## grab the page -- the table is parsed nicely -- why work 2.10, but not 2.12.1?
tables <- readHTMLTable(URL)

非常感谢您能提供的任何帮助.

Any help you can provide will be much appreciated.

推荐答案

我不确定这个问题是否因为迁移到 v2.12.1 而出现.我在 2.12.1 上试过了,还是同样的错误.

I am not sure whether this problem occurs because of the move to v2.12.1 or not. I tried it on 2.12.1 and get the same error.

但是,错误也可能是因为 HTML 中的某些内容发生了变化.我查看了该页面上的 HTML 源代码,该表格的格式并不像人们希望的那样好.HTML 表格有两个问题:1) 第一个标题行包含合并的列,2) 标题行重复.

However, the error might also occur because something in the HTML changed. I had a look at the HTML source on that page, and the table isn't as well formed as one would hope. There are two problems with the HTML table: 1) the first header row contains merged columns, and 2) the header row gets repeated.

这是第一个导致您的代码返回错误的原因.数据行的长度为 19,但标题由两行组成,其中一行长度为 19,另一行长度为 5,即总共 24.正是这种差异引发了您的错误.

It is the first of these that causes your code to return an error. The data rows are of length 19, but the header consists of two rows, one of lenght 19 and one of length 5, i.e. 24 in total. It is this discrepancy that throws your error.

我无法使用 readHTMLTable() 函数抓取此页面.但这是我使用 scrapeR 和 XML 中的工具的解决方案:

I haven't been able to scrape this page using the readHTMLTable() function. But here is my solution using the tools in scrapeR and XML:

# load the libraries
library(XML)
library(scrapeR)
library(plyr)
library(stringr)

# scrape and parse page
page <- scrape(url=URL, parse=TRUE)
raw <- xpathSApply(page[[1]], "//table//tr", xmlValue)
# split strings at each line break
rows <- strsplit(raw, "\n")
# now check for longest row length, and discard all short rows
rowlength <- (laply(rows, length))
rows <- rows[rowlength==max(rowlength)]
# unlist each row
rows <- laply(rows, function(x)unlist(x))
# trim white space
rows <- aaply(rows, c(1,2), str_trim)
# convert to data frame
df <- as.data.frame(rows, stringsAsFactors = FALSE)
# read names from first row
names(df) <- laply(df[1, ], str_trim)
# remove all rows without a numerix index
df <- df[which(!is.na(as.numeric(df$Rk))), ]
df

代码有点乱,表格也不干净,因为所有的数据都是字符向量,而不是数字.

The code is a little bit messy, and the table isn't clean, since the all of the data are character vectors, rather than numeric.

但这至少意味着您拥有可以进一步处理的格式的数据.

But at least this means you have the data in a format that you can process further.

这篇关于2.12 中的 XML 包错误,但不是 2.10的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆