当一个变量实际上是两列时,expand.grid [英] expand.grid when one variable is really two columns
问题描述
我有一个包含地区,县和年份的数据集.如果给定的地区/县组合发生在任何一年,我希望该组合发生在每年.下面是我想出的两种方法.第一种方法使用一个函数来创建地区,县和年份的组合,并且只需要六行代码.底部方法使用paste
,expand.grid
和strsplit
的组合,更加复杂/复杂.
I have a data set with districts, counties and years. If a given district/county combination occurs in any year I want that combination to occur in every year. Below are two ways I have figured out to do this. The first approach uses a function to create combinations of district, county and year and only requires six lines of code. The bottom approach uses a combination of paste
, expand.grid
and strsplit
and is much more complex/convoluted.
可能有比上述任何一种方法更有效的方法.例如,是否有一种方法可以使用expand.grid
可能仅用1或2行代码即可实现地区/县/年的组合?
There are probably much more efficient methods than either above. For example, is there a way to use expand.grid
that might achieve the district/county/year combinations perhaps with only 1 or 2 lines of code?
谢谢您的任何建议.我的职能部门可以胜任,但是这个问题对我来说是一个学习的机会.我更喜欢R.
Thank you for any advice. My function can do the job, but this question is a learning opportunity for me. I prefer base R.
这是示例数据集:
df.1 <- read.table(text = '
state district county year apples
AA EC A 1980 100
AA EC B 1980 10
AA EC C 1980 150
AA C G 1980 200
AA C other 1980 20
AA C I 1980 250
AA WC R 1980 300
AA WC S 1980 30
AA WC other 1980 350
AA EC A 1999 1100
AA EC D 1999 110
AA EC E 1999 1150
AA C H 1999 1200
AA C I 1999 120
AA C J 1999 1250
AA WC R 1999 1300
AA WC other 1999 130
AA WC T 1999 1350
', header=TRUE, stringsAsFactors = FALSE)
这是预期的结果:
desired.result <- read.table(text = '
state district county year apples
AA C G 1980 200
AA C H 1980 NA
AA C I 1980 250
AA C J 1980 NA
AA C other 1980 20
AA EC A 1980 100
AA EC B 1980 10
AA EC C 1980 150
AA EC D 1980 NA
AA EC E 1980 NA
AA WC other 1980 350
AA WC R 1980 300
AA WC S 1980 30
AA WC T 1980 NA
AA C G 1999 NA
AA C H 1999 1200
AA C I 1999 120
AA C J 1999 1250
AA C other 1999 NA
AA EC A 1999 1100
AA EC B 1999 NA
AA EC C 1999 NA
AA EC D 1999 110
AA EC E 1999 1150
AA WC other 1999 130
AA WC R 1999 1300
AA WC S 1999 NA
AA WC T 1999 1350
', header=TRUE, stringsAsFactors = FALSE)
这是到目前为止我最简单的解决方案,它使用一个函数来表示每年的每个区/县组合:
Here is my most straight-forward solution so far, which uses a function to represent every district/county combination for each year:
my.unique.function <- function(year) {
my.unique <- data.frame(unique(df.1[, c('state', 'district', 'county')]), year)
return(my.unique = my.unique)
}
years <- as.data.frame(unique(df.1[, 'year']))
my.unique.output <- apply(years, 1, function(x) {my.unique.function(x)})
my.unique.output2 <- do.call(rbind.data.frame, my.unique.output)
desired.result2 <- merge(df.1, my.unique.output2, by = c('state', 'year', 'district', 'county'), all=TRUE)
# compare output with a priori desired result
desired.result <- desired.result[order(desired.result$state, desired.result$year, desired.result$district, desired.result$county),]
all.equal(desired.result[,c(1,4,2,3,5)], desired.result2[,1:5])
这是我最初的更复杂的解决方案:
Here is my initial, much more complex solution:
# find unique combinations of district and county
my.unique <- unique(df.1[, c('district', 'county')])
# paste district and county together
my.unique$x <- apply( my.unique[ , c('district', 'county') ] , 1 , paste , collapse = "-" )
# represent each district/county combination for each year
expand.unique <- expand.grid(my.unique$x, year = c(1980, 1999))
expand.unique$Var1 <- as.character(expand.unique$Var1)
# split combined district/county combinations into two columns
expand.unique$Var1b <- sub('-', ' ', expand.unique$Var1)
unique.split <- strsplit(expand.unique$Var1b, ' ')
unique.splits <- matrix(unlist(unique.split), ncol=2, byrow=TRUE, dimnames = list(NULL, c("district", "county")))
# create template prior to merging with original data set
state <- 'AA'
desired.resultb <- data.frame(state, expand.unique, unique.splits)
# merge template with original data set so missing observations are present if a county is not included for a given year
desired.resultc <- merge(df.1, desired.resultb, by = c('state', 'year', 'district', 'county'), all=TRUE)
desired.resultc
# compare output with a priori desired result
desired.result <- desired.result[order(desired.result$state, desired.result$year, desired.result$district, desired.result$county),]
all.equal(desired.result[,c(1,4,2,3,5)], desired.resultc[,1:5])
推荐答案
#find all (unique) state-district-county combinations
df.2 <- unique(df.1[,c("state","district","county")])
#find all (unique) years
df.3 <- unique(df.1[,"year",drop=FALSE])
#Cartesian product of combinations
df.4 <- merge(df.2,df.3)
#merge this with the original data frame,
#leaving NA's for unmatched parts in df.4
merge(df.1,df.4,all=TRUE)
这篇关于当一个变量实际上是两列时,expand.grid的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!