使用R中的可变字符串引用对象 [英] Referring to objects using variable strings in R

查看:67
本文介绍了使用R中的可变字符串引用对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

感谢迄今已做出回应的人;我是R的初学者,刚刚为我的MSc学位论文进行了一个大型项目,因此对于初始处理有些不知所措.我正在使用的数据如下(来自WMO可公开获得的降雨数据):


120 6272100 KHARTOUM 15.60 32.55 382 1899 1989 0.0
1899 0.03 0.03 0.03 0.03 0.03 1.03 13.03 12.03 9999 6.03 0.03 0.03
1900 0.03 0.03 0.03 0.03 0.03 23.03 80.03 47.03 23.03 8.03 0.03 0.03
1901 0.03 0.03 0.03 0.03 0.03 17.03 23.03 17.03 0.03 8.03 0.03 0.03
(...)
120 6272101 JEBEL AULIA 15.20 32.50 380 1920 1988 0.0
1920 0.03 0.03 0.03 0.00 0.03 6.90 20.00 108.80 47.30 1.00 0.01 0.03
1921 0.03 0.03 0.03 0.00 0.03 0.00 88.00 57.00 35.00 18.50 0.01 0.03
1922 0.03 0.03 0.03 0.00 0.03 0.00 87.50 102.30 10.40 15.20 0.01 0.03
(...)

There are ~100 observation stations that I'm interested in, each of which has a varying start and end date for rainfall measurements. They're formatted as above in a single data file, with stations separated by "120 (station number) (station name)".

I need first to separate this file by station, then to extract March, April, May and June for each year, then take a total of these months for each year. So far I'm messing around with loops (as below), but I understand this isn't the right way to go about it and would rather learn some better technique. Thanks again for the help!

(Original question:) I've got a large data set containing rainfall by season for ~100 years over 100+ locations. I'm trying to separate this data into more managable arrays, and in particular I want to retrieve the sum of the rainfall for March, April, May and June for each station for each year. The following is a simplified version of my code so far:

So this gives me station_(i)__mamj_,其中包含我对每个电台感兴趣的月份的数据.现在,我想对这个数组的每一行求和,并将其输入到名为station_(i)_mamj_tot的新数组中.理论上很简单,但是我不知道如何引用station_(i)_mamj 以便它每次迭代都会改变i 的值.任何帮助,不胜感激!

解决方案

这完全是乞求一个数据帧,然后它只是带有ddply(非常强大)之类的强大工具的这种单行代码:

tot_mamj <- ddply(rain[rain$month %in% 3:6,-2], 'year', colwise(sum))

按年份给出M/A/M/J的总计:

   year station_1 station_2 station_3 station_4 station_5 ...
1  1972  8.618960  5.697739 10.083192  9.264512 11.152378 ...
2  1973 18.571748 18.903280 11.832462 18.262272 10.509621 ...
3  1974 22.415201 22.670821 32.850745 31.634717 20.523778 ...
4  1975 16.773286 17.683704 18.259066 14.996550 19.007762 ...
...

下面是完美的工作代码.我们创建一个数据框,其 col.names 为"station_n";还包括年和月的额外列(因子,如果您很懒,则为整数,请参见脚注).现在,您可以按月或按年进行任意分析(使用plyr的split-apply-combine范例):

require(plyr) # for d*ply, summarise
#require(reshape) # for melt

# Parameterize everything here, it's crucial for testing/debugging
all_years <- c(1970:2011)
nYears <- length(all_years)  
nStations <- 101
# We want station names as vector of chr (as opposed to simple indices)
station_names <- paste ('station_', 1:nStations, sep='')

rain <- data.frame(cbind(
  year=rep(c(1970:2011),12),
  month=1:12
))
# Fill in NAs for all data
rain[,station_names] <- as.numeric(NA)
# Make 'month' a factor, to prevent any numerical funny stuff e.g accidentally 'aggregating' it
rain$month <- factor(rain$month)

# For convenience, store the row indices for all years, M/A/M/J
I.mamj <- which(rain$month %in% 3:6)

# Insert made-up seasonal data for M/A/M/J for testing... leave everything else NA intentionally
rain[I.mamj,station_names] <- c(3,5,9,6) * runif(4*nYears*nStations)

# Get our aggregate of MAMJ totals, by year
# The '-2' column index means: "exclude month, to prevent it also getting 'aggregated'"
excludeMonthCol = -2
tot_mamj <- ddply(rain[rain$month %in% 3:6, excludeMonthCol], 'year', colwise(sum))

# voila!!
#    year station_1 station_2 station_3 station_4 station_5
# 1  1972  8.618960  5.697739 10.083192  9.264512 11.152378
# 2  1973 18.571748 18.903280 11.832462 18.262272 10.509621
# 3  1974 22.415201 22.670821 32.850745 31.634717 20.523778
# 4  1975 16.773286 17.683704 18.259066 14.996550 19.007762

作为一个脚注,在我将月份从数值转换为因子之前,它已经默默地汇总"(直到我输入"-2":排除列引用). 但是,更好的是,当您将其作为一个因素时,它将拒绝点空白进行汇总,并引发错误(调试时希望这样做):

 ddply(rain[rain$month %in% 3:6, ], 'year', colwise(sum))
Error in Summary.factor(c(3L, 3L, 3L, 3L, 3L, 3L), na.rm = FALSE) : 
  sum not meaningful for factors

Edit: Thanks to those who have responded so far; I'm very much a beginner in R and have just taken on a large project for my MSc dissertation so am a bit overwhelmed with the initial processing. The data I'm using is as follows (from WMO publically available rainfall data):


120 6272100 KHARTOUM 15.60 32.55 382 1899 1989 0.0
1899 0.03 0.03 0.03 0.03 0.03 1.03 13.03 12.03 9999 6.03 0.03 0.03
1900 0.03 0.03 0.03 0.03 0.03 23.03 80.03 47.03 23.03 8.03 0.03 0.03
1901 0.03 0.03 0.03 0.03 0.03 17.03 23.03 17.03 0.03 8.03 0.03 0.03
(...)
120 6272101 JEBEL AULIA 15.20 32.50 380 1920 1988 0.0
1920 0.03 0.03 0.03 0.00 0.03 6.90 20.00 108.80 47.30 1.00 0.01 0.03
1921 0.03 0.03 0.03 0.00 0.03 0.00 88.00 57.00 35.00 18.50 0.01 0.03
1922 0.03 0.03 0.03 0.00 0.03 0.00 87.50 102.30 10.40 15.20 0.01 0.03
(...)

There are ~100 observation stations that I'm interested in, each of which has a varying start and end date for rainfall measurements. They're formatted as above in a single data file, with stations separated by "120 (station number) (station name)".

I need first to separate this file by station, then to extract March, April, May and June for each year, then take a total of these months for each year. So far I'm messing around with loops (as below), but I understand this isn't the right way to go about it and would rather learn some better technique. Thanks again for the help!

(Original question:) I've got a large data set containing rainfall by season for ~100 years over 100+ locations. I'm trying to separate this data into more managable arrays, and in particular I want to retrieve the sum of the rainfall for March, April, May and June for each station for each year. The following is a simplified version of my code so far:

a <- array(1,dim=c(10,12))
for (i in 1:5) {

  all data:
  assign(paste("station_",i,sep=""), a)

  #march - june data:
  assign(paste("station_",i,"_mamj",sep=""), a[,4:7])
}

So this gives me station_(i)__mamj_ which contains the data for the months I'm interested in for each station. Now I want to sum each row of this array and enter it in a new array called station_(i)_mamj_tot. Simple enough in theory, but I can't work out how to reference station_(i)_mamj so that it varies the value of i with each iteration. Any help much appreciated!

解决方案

This is totally begging for a dataframe, then it's just this one-liner with power-tools like ddply (amazingly powerful):

tot_mamj <- ddply(rain[rain$month %in% 3:6,-2], 'year', colwise(sum))

giving your aggregate of total for M/A/M/J, by year:

   year station_1 station_2 station_3 station_4 station_5 ...
1  1972  8.618960  5.697739 10.083192  9.264512 11.152378 ...
2  1973 18.571748 18.903280 11.832462 18.262272 10.509621 ...
3  1974 22.415201 22.670821 32.850745 31.634717 20.523778 ...
4  1975 16.773286 17.683704 18.259066 14.996550 19.007762 ...
...

Below is perfectly working code. We create a dataframe whose col.names are 'station_n'; also extra columns for year and month (factor, or else integer if you're lazy, see the footnote). Now you can do arbitrary analysis by month or year (using plyr's split-apply-combine paradigm):

require(plyr) # for d*ply, summarise
#require(reshape) # for melt

# Parameterize everything here, it's crucial for testing/debugging
all_years <- c(1970:2011)
nYears <- length(all_years)  
nStations <- 101
# We want station names as vector of chr (as opposed to simple indices)
station_names <- paste ('station_', 1:nStations, sep='')

rain <- data.frame(cbind(
  year=rep(c(1970:2011),12),
  month=1:12
))
# Fill in NAs for all data
rain[,station_names] <- as.numeric(NA)
# Make 'month' a factor, to prevent any numerical funny stuff e.g accidentally 'aggregating' it
rain$month <- factor(rain$month)

# For convenience, store the row indices for all years, M/A/M/J
I.mamj <- which(rain$month %in% 3:6)

# Insert made-up seasonal data for M/A/M/J for testing... leave everything else NA intentionally
rain[I.mamj,station_names] <- c(3,5,9,6) * runif(4*nYears*nStations)

# Get our aggregate of MAMJ totals, by year
# The '-2' column index means: "exclude month, to prevent it also getting 'aggregated'"
excludeMonthCol = -2
tot_mamj <- ddply(rain[rain$month %in% 3:6, excludeMonthCol], 'year', colwise(sum))

# voila!!
#    year station_1 station_2 station_3 station_4 station_5
# 1  1972  8.618960  5.697739 10.083192  9.264512 11.152378
# 2  1973 18.571748 18.903280 11.832462 18.262272 10.509621
# 3  1974 22.415201 22.670821 32.850745 31.634717 20.523778
# 4  1975 16.773286 17.683704 18.259066 14.996550 19.007762

As a footnote, before I converted month from numeric to factor, it was getting silently 'aggregated' (until I put in the '-2': exclude column reference). However, better still is when you make it a factor, it will refuse point-blank to be aggregate'd, and throw an error (which is desirable for debugging):

 ddply(rain[rain$month %in% 3:6, ], 'year', colwise(sum))
Error in Summary.factor(c(3L, 3L, 3L, 3L, 3L, 3L), na.rm = FALSE) : 
  sum not meaningful for factors

这篇关于使用R中的可变字符串引用对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆