混合数据帧的总和取决于R中的多个条件 [英] Sum of hybrid data frames depending on multiple conditions in R

查看:82
本文介绍了混合数据帧的总和取决于R中的多个条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我之前问题的更为复杂的后续行动。答案是使用矩阵,但是不适用于具有不同模式值的数据帧。



我想组合不同的数据帧大小,带有字符和整数列,并根据多个条件计算其总和。



条件




  1. 只对具有匹配名称值的行计算总和

  2. 计算匹配列名称的总和只有

  3. 如果 df4 中的单元格不为0而不是NA,则总和应为 df3 + df4

  4. 否则总和应为 df1 + df2 + df3



示例



 > df1<  -  data.frame(Name = c(Joe,Ann,Lee,Dan),1= c(0,1,5,2),2= c 3,1,0,0),3= c(2,0,2,2),4= c(2,1,3,4))
> df1
名称X1 X2 X3 X4
1 Joe 0 3 2 2
2安1 1 0 1
3李5 0 2 3
4丹2 0 2 4

> df2< - data.frame(Name = c(Joe,Ann,Ken),1= c(3,4,1),2= c(2,3,0) ,3= c(2,4,3))
> df2
名称X1 X2 X3
1 Joe 3 2 2
2安4 3 4
3肯1 0 3

> df3< - data.frame(Name = c(Lee,Ben),1= c(1,3),2= c(3,4),3= c ,3))
> df3
名称X1 X2 X3
1李1 3 4
2本3 4 3

条件取决于此框架:

 > df4<  -  data.frame(Name = c(Lee,Ann,Dan),1= c(6,0,NA),2= c(0,0,4) ,3= c(0,NA,0))
> df4
名称X1 X2 X3
1 Lee 6 0 0
2安0 0 NA
3丹NA 4 0

使用上述示例,这是预期结果(*值取决于df4):

 > dfsum 
名称X1 X2 X3 X4
1 Joe 3 5 4 2
2安5 4 4 1
3李7 * 3 6 3
4丹2 4 * 2 4
5肯1 0 3 NA
6本3 4 3 NA



可能的步骤?



首先将df1,df2,df3,df4扩展为5列和6行,填写NA中缺少的数据。



然后对于每个数据框:


  1. 按Name排序行


  2. 将X1...X4列转换为矩阵

  3. 计算矩阵的总和,如我的另一个问题的答案,附加条件1

  4. 将结果矩阵转换为数据框

  5. cbind具有结果数据框架的名称列



如何在R? >




解决方案



@Ricardo Saporta的解决方案的作用很小: p>

在四个addCols()中添加,padValue = NA)



此处,将sumD3D4和dtsum的定义替换为:

 加<  -  function(x){
if(all(is.na(x))){
c(x [0],NA)} else {
sum(x,na.rm = TRUE)}
}

sumD3D4< - setkey(rbind(dt3,dt4)[,lapply(.SD,plus) by = Name],Name)
dtsum< - setkey(rbind(dt1,dt2,dt3)[,lapply(.SD,plus),by = Name],Name)


解决方案

如果使用data.table而不是data.frame,其 by = xxxx 功能,按名称添加。
下面的代码应该给你预期的结果。



请注意,我正在使用额外的空列填充data.tables。但是,在此之前,我们计算 condTrue

  library(data.table)
dt1 < - data.table(df1)
dt2< ; - data.table(df2)
dt3< - data.table(df3)
dt4< - data.table(df4)

#确保所有dt都有相同的列
#-----------------------------------------#

#识别哪个dt4满足条件
condTrue< - as.data.table(which(!(is.na(dt4)| dt4 == 0),arr.ind = TRUE))

#忽略来自dt4
的列名称condTrue< - condTrue [col> 1]

#convert(row,col)index to(Name,columnName)
condTrue< - data.table(Name = dt4 [condTrue $ row,Name],colm = names(dt4)[condTrue $ col],key =Name b

$ b#首先列出所有唯一的列名称
allColumnNames< - unique(c(names(dt1),names(dt2),names(dt3),names (dt4)))

#根据需要添加列,使用addCols(如下所示)
addCols(dt1,allColumnNames)
addCols(dt2,allColumnNames)
addCols (dt3,allColumnNames)
addCo ls(dt4,allColumnNames)


sumD3D4< - setkey(rbind(dt3,dt4)[,lapply(.SD,sum),by = Name],Name b $ b dtsum< - setkey(rbind(dt1,dt2,dt3)[,lapply(.SD,sum),by = Name],Name)

for(Nam in condTrue $名称){
colsRepl< - condTrue [。(Nam)] $ colm
valsRepl< - unlist(sumD3D4 [。(Nam),c(colsRepl),with = FALSE])
dtsum [。(Nam),c(colsRepl):= as.list(valsRepl)]
}

dtsum
#名称1 2 3 4
# 1:Ann 5 4 4 1
#2:Ben 3 4 3 0
#3:Dan 2 4 2 4
#4:Joe 3 5 4 2
#5:肯1 0 3 0
#6:李7 3 6 3






  addCols<  -  function(x,cols,padValue = 0){
#添加到x列中的任何列,但不是x
#如果添加了列,则返回TRUE
#如果没有列添加,则为FALSE
colsMissing< - setdiff(cols,names(x))

#抓住实际的DT传递给函数
dtName< - as.charac的名称ter(match.call()[2])

if(length(colsMissing)){
get(dtName,envir = parent.frame(1))[,c(colsMissing) := padValue]
return(TRUE)
}

return(FALSE)
}


This is a more complex follow-up to my previous question. The answer there was to use a matrix, but that doesn't work with data frames having values of different modes.

I want to combine data frames of different sizes, with character and integer columns, and calculate their sum depending on multiple conditions.

Conditions

  1. sums are only calculated for those rows that have a matching "Name"-value
  2. sums are calculated for matching column names only
  3. if a cell in df4 is not 0 and not NA, the sum should be df3 + df4
  4. else the sum should be df1 + df2 + df3

Example

> df1 <- data.frame(Name=c("Joe","Ann","Lee","Dan"), "1"=c(0,1,5,2), "2"=c(3,1,0,0), "3"=c(2,0,2,2), "4"=c(2,1,3,4))
> df1
  Name X1 X2 X3 X4
1  Joe  0  3  2  2
2  Ann  1  1  0  1
3  Lee  5  0  2  3
4  Dan  2  0  2  4

> df2 <- data.frame(Name=c("Joe","Ann","Ken"), "1"=c(3,4,1), "2"=c(2,3,0), "3"=c(2,4,3))
> df2
  Name X1 X2 X3
1  Joe  3  2  2
2  Ann  4  3  4
3  Ken  1  0  3

> df3 <- data.frame(Name=c("Lee","Ben"), "1"=c(1,3), "2"=c(3,4), "3"=c(4,3))
> df3
  Name X1 X2 X3
1  Lee  1  3  4
2  Ben  3  4  3

The condition depends on this frame:

> df4 <- data.frame(Name=c("Lee","Ann","Dan"), "1"=c(6,0,NA), "2"=c(0,0,4), "3"=c(0,NA,0))
> df4
   Name  X1  X2  X3
1   Lee   6   0   0
2   Ann   0   0  NA 
3   Dan  NA   4   0

With the above examples, this is the expected result (* values depend on df4):

> dfsum
  Name  X1  X2  X3  X4
1  Joe   3   5   4   2
2  Ann   5   4   4   1
3  Lee   7*  3   6   3
4  Dan   2   4*  2   4
5  Ken   1   0   3  NA
6  Ben   3   4   3  NA

Possible steps?

First expand df1, df2, df3, df4 to 5 columns and 6 rows, fill missing data with NA.

Then for each data frame:

  1. sort rows by "Name"
  2. separate "Name" column from "X1"..."X4"
  3. transform "X1"..."X4" columns to matrix
  4. calculate sums of the matrices like in the answer to my other question but with the additional condition 1
  5. transform result matrix to data frame
  6. cbind the "Name" column with the result data frame

How can this be done in R?


Solution

@Ricardo Saporta's solution works with little changes:

Add , padValue=NA) in the four addCols().

As answered here, replace the definitions of sumD3D4 and dtsum with:

plus <- function(x) {
  if(all(is.na(x))){
    c(x[0],NA)} else {
      sum(x,na.rm = TRUE)}
}

sumD3D4  <- setkey(rbind(dt3, dt4)[,lapply(.SD, plus), by = Name], "Name")
dtsum <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, plus), by=Name], "Name")

解决方案

If you use data.table instead of data.frame, you could use its by=xxxx feature, to add by name. The code below should give you your expected results.

Please note that I am padding the data.tables with extra empty columns. However, we compute condTrue prior to then.

library(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt3 <- data.table(df3)
dt4 <- data.table(df4)

# make sure all dt's have the same columns 
#-----------------------------------------#

# identify which dt4 satisfy the condition 
condTrue <- as.data.table(which(!(is.na(dt4) | dt4==0), arr.ind=TRUE))

# ignore column "Name" from dt4
condTrue <- condTrue[col>1]

# convert from (row, col) index to ("Name", columnName) 
condTrue <- data.table(Name=dt4[condTrue$row, Name], colm=names(dt4)[condTrue$col], key="Name")


# First make a list of all the unique column names
allColumnNames <- unique(c(names(dt1), names(dt2), names(dt3), names(dt4)))

# add columns as necessary, using addCols (definted below)
addCols(dt1, allColumnNames)
addCols(dt2, allColumnNames)
addCols(dt3, allColumnNames)
addCols(dt4, allColumnNames)


sumD3D4  <- setkey(rbind(dt3, dt4)[, lapply(.SD, sum), by=Name], "Name")
dtsum    <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, sum), by=Name], "Name")

for (Nam in condTrue$Name) {
  colsRepl <- condTrue[.(Nam)]$colm
  valsRepl <- unlist(sumD3D4[.(Nam), c(colsRepl), with=FALSE])
  dtsum[.(Nam), c(colsRepl) :=  as.list(valsRepl)]
}

dtsum
#    Name 1 2 3 4
# 1:  Ann 5 4 4 1
# 2:  Ben 3 4 3 0
# 3:  Dan 2 4 2 4
# 4:  Joe 3 5 4 2
# 5:  Ken 1 0 3 0
# 6:  Lee 7 3 6 3


addCols <- function(x, cols, padValue=0)  {
  # adds to x any columns that are in cols but not in x
  # Returns TRUE  if columns were added
  #         FALSE if no columns added 
  colsMissing <- setdiff(cols, names(x))

  # grab the actual DT name that was passed to function
  dtName <- as.character(match.call()[2])

  if (length(colsMissing)) {
    get(dtName, envir=parent.frame(1))[, c(colsMissing) := padValue]  
    return(TRUE)
  }

  return(FALSE)
}

这篇关于混合数据帧的总和取决于R中的多个条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆