将HTML表提取到R中 [英] Extracting HTML table into R

查看:118
本文介绍了将HTML表提取到R中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试从网页中提取表格.数据是来自实时航班跟踪网站的航班跟踪数据( https://flightaware.com/live/flight/WJA1508/history/20150814/1720Z/CYYC/KSFO/tracklog ).

I've been trying to extract a table from a webpage. The data is a flight track data from live flight tracking website (https://flightaware.com/live/flight/WJA1508/history/20150814/1720Z/CYYC/KSFO/tracklog).

我已经尝试过XML,RCurl和Curl包,但是没有用.我认为这很有可能是因为我无法弄清楚如何避免使用SSL以及包含航班状态注释的列(即表格顶部的前两个和表格底部的第三个).

I've tried XML, RCurl and Curl packages, but I didn't work. I believe most likely because I couldn't figure out how to avoid the SSL as well as the columns that contains notes on the flight status (i. e., first two from the top and third from the bottom of the table).

有人知道如何将这个表提取为int R吗?

Can any one knows how extract this table int R?

推荐答案

正如@hrbrmstr在上面的注释中指出的那样,这违反了FlightAware的TOS,但是您对代码所做的是您的业务. :)使用rvest包,应该可以为您提供大部分帮助:

As noted by @hrbrmstr in the comments above, this violates FlightAware's TOS, but what you do with your code is your business. :) This should get you most of the way there using the rvest package:

library(rvest)

u <- "https://flightaware.com/live/flight/WJA1508/history/20150814/1720Z/CYYC/KSFO/tracklog"

html_read <- html(u)
tbl <- html_table(
  html_nodes(html_read, "table"), 
  fill=TRUE, 
  header=FALSE, 
  trim=TRUE 
)[[2]]

##  Subset to the first row of data and remove all extra
##    columns:
tbl_o <- tbl[6:nrow(tbl), ]
tbl_o <- tbl_o[,colSums(is.na(tbl_o))!=nrow(tbl_o)]

names(tbl_o) <- c(
  "Time", "Lat", "Lon", 
  "Course", "Direction", 
  "KTS", "MPH", "Alt", 
  "Rate", "Location"
)

str(tbl_o)

哪个产量:

'data.frame':   292 obs. of  10 variables:
 $ Time     : chr  "Fri 01:41:34 PM" "Fri 01:48:59 PM" "Fri 01:49:14 PM" "Fri 01:50:05 PM" ...
 $ Lat      : chr  "51.0833" "51.1551" "51.1683" "51.2235" ...
 $ Lon      : chr  "-113.9667" "-114.0209" "-114.0209" "-114.0220" ...
 $ Course   : chr  "335°" "0°" "0°" "358°" ...
 $ Direction: chr  "Northwest" "North" "North" "North" ...
 $ KTS      : chr  "20" "201" "219" "149" ...
 $ MPH      : chr  "23" "231" "252" "171" ...
 $ Alt      : chr  "3,500" "4,900" "5,200" "6,800" ...
 $ Rate     : chr  "" "222" "1,727" "1,701" ...
 $ Location : chr  "Edmonton Center" "FlightAware ADS-B  (CYYC)" "FlightAware ADS-B  (CYYC)" "FlightAware ADS-B  (CEG2)" ...

这篇关于将HTML表提取到R中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆