使用 R 使用 4 个 url 抓取站点外的数据一天 [英] Scraping data off site using 4 urls for one day using R

查看:66
本文介绍了使用 R 使用 4 个 url 抓取站点外的数据一天的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从马来西亚环境部网站抓取所有历史空气污染指数数据,该网站将所有站点的数据分成每天 4 小时的链接,如下所示

I am trying to scrape all the historical Air Pollution Index data from the Malaysian Department of Environment site that has the data split for all the stations into 4 hourly links per/day as below

http://apims.doe.gov.my/apims/hourly1.php?date=20130701http://apims.doe.gov.my/apims/hourly2.php?日期=20130701

与上面的hourly3.php?"相同和hourly4.php?"

Same as above with 'hourly3.php?' and 'hourly4.php?'

我对 R 只是有点熟悉,那么使用 XML 或 scrapeR 库最简单的方法是什么?

I am only a bit familiar with R so what would be the easiest way to do this using maybe the XML or scrapeR library?

推荐答案

你可以通过列表操作将所有表格变成一个宽数据框:

You can turn all the tables into a wide data frame with list operations:

library(rvest)
library(magrittr)
library(dplyr)

date <- 20130701
rng <- c(1:4)

my_tabs <- lapply(rng, function(i) {
  url <- sprintf("http://apims.doe.gov.my/apims/hourly%d.php?date=%s", i, date)
  pg <- html(url)
  pg %>% html_nodes("table") %>% extract2(1) %>% html_table(header=TRUE)
})

glimpse(plyr::join_all(my_tabs, by=colnames(my_tabs[[1]][1:2])))

## Observations: 52
## Variables:
## $ NEGERI / STATE   (chr) "Johor", "Johor", "Johor", "Johor", "Kedah...
## $ KAWASAN/AREA     (chr) "Kota Tinggi", "Larkin Lama", "Muar", "Pas...
## $ MASA/TIME12:00AM (chr) "63*", "53*", "51*", "55*", "37*", "48*", ...
## $ MASA/TIME01:00AM (chr) "62*", "52*", "52*", "55*", "36*", "48*", ...
## $ MASA/TIME02:00AM (chr) "61*", "51*", "53*", "55*", "35*", "48*", ...
## $ MASA/TIME03:00AM (chr) "60*", "50*", "54*", "55*", "35*", "48*", ...
## $ MASA/TIME04:00AM (chr) "59*", "49*", "54*", "54*", "34*", "47*", ...
## $ MASA/TIME05:00AM (chr) "58*", "48*", "54*", "54*", "34*", "45*", ...
## $ MASA/TIME06:00AM (chr) "57*", "47*", "53*", "53*", "33*", "45*", ...
## $ MASA/TIME07:00AM (chr) "57*", "46*", "52*", "53*", "32*", "45*", ...
## $ MASA/TIME08:00AM (chr) "56*", "45*", "52*", "52*", "32*", "44*", ...
## ...

由于与 dplyr 的命名冲突,我很少实际加载/使用 plyr,但 join_all 非常适合这种情况.

I rarely actually load/use plyr anymore due to naming collisions with dplyr but join_all is perfect for this situation.

您也可能需要长格式的数据:

It's also likely you'll need this data in long format:

plyr::join_all(my_tabs, by=colnames(my_tabs[[1]][1:2])) %>% 
  tidyr::gather(masa, nilai, -1, -2) %>%
# better column names
  rename(nigeri=`NEGERI / STATE`, kawasan=`KAWASAN/AREA`) %>%  
# cleanup & convert time (using local timezone)
# make readings numeric; NA will sub for #
  mutate(masa=gsub("MASA/TIME", "", masa), 
         masa=as.POSIXct(sprintf("%s %s", date, masa), format="%Y%m%d %H:%M%p", tz="Asia/Kuala_Lumpur"),
         nilai=as.numeric(gsub("[[:punct:]]+", "", nilai))) -> pollut

head(pollut)
##   nigeri                 kawasan                masa nilai
## 1  Johor             Kota Tinggi 2013-07-01 12:00:00    63
## 2  Johor             Larkin Lama 2013-07-01 12:00:00    53
## 3  Johor                    Muar 2013-07-01 12:00:00    51
## 4  Johor            Pasir Gudang 2013-07-01 12:00:00    55
## 5  Kedah              Alor Setar 2013-07-01 12:00:00    37
## 6  Kedah Bakar Arang, Sg. Petani 2013-07-01 12:00:00    48

这篇关于使用 R 使用 4 个 url 抓取站点外的数据一天的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆