使用rowspan值提取html表 [英] Extracting html table with rowspan values

查看:356
本文介绍了使用rowspan值提取html表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下代码创建的数据框(使用 RCurl XML 包)将三个字母的团队缩写放入其跨越的第一行。我可以添加另一个包或其他代码以将数据保存在正确的列中吗?

The data frame I create with the following code (using the RCurl and XML packages) puts the three letter team abbreviation into only the first row that it spans. Is there another package or additional code I can add to keep the data in the proper column?

library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Major_League_Baseball_postseason_teams"
url_source <- readLines(url, encoding = "UTF-8")
playoffs <- data.frame(readHTMLTable(url_source, stringsAsFactors = F, header = T) [2])


推荐答案

你实际上非常接近。您需要做的唯一事情是将数据放在正确的列和行中,因为有些行已经向左移动。您可以按照以下方式(在 data.table zoo 包的帮助下)实现此目的:

You are actually pretty close. The only thing you need to do is to get the data in the proper columns and rows as some of the rows have shifted to the left. You can achieve this as follows (with the help of the data.table and zoo packages):

# your original code
url <- "https://en.wikipedia.org/wiki/List_of_Major_League_Baseball_postseason_teams"
url_source <- readLines(url, encoding = "UTF-8")
playoffs <- data.frame(readHTMLTable(url_source, stringsAsFactors = F, header = T)[2])

# assigning proper names to the columns
names(playoffs) <- c("shortcode","franchise","years","appearances")

# 1. shift the dat columnwise for the rows in which there is no shortcode
# 2. fill the resulting NA's with the last observation
# 3. only keep the last shortcode when the previous ones are the same
#    because only there the shortcode matches the franchise name
library(data.table)
library(zoo)
setDT(playoffs)[nchar(shortcode) > 3, `:=` (shortcode = NA,
                                            franchise = shortcode,
                                            years = franchise,
                                            appearances = years)
                ][, shortcode := na.locf(shortcode)
                  ][shortcode == shift(shortcode, 1L, type="lead"), shortcode := NA]

这篇关于使用rowspan值提取html表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆