R条件通过查找更换列 [英] R conditional replace more columns by lookup

查看:144
本文介绍了R条件通过查找更换列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们确实有数据列(名称为 mycols ,还有一些未命名的数据列,在这种情况下不应该被处理)在数据框中 df1 和列 subj ,它也是另一个数据帧 df2 的索引,列 repl subj (在第二个数据帧中为 subj 唯一)和其他许多非重要的列(它们唯一的作用就是我们不能假设只有2列)。



我想以这样的方式替换列的一个子集( df1 [,mycols] ),如果有一个 NA df1 [,mycols] [is.na(df1 [,mycols])] )< - 替换为列的值 df2 $ repl df2 中的行已经有 df2 $ subj = df1 $ subj



编辑:示例数据不要将命令写入数据帧分配):

  mycols = c(a,b)
df1:
subj abc
1 NA NA 1
1 2 3 5
2 0 NA 2
3 8 8 8
df2:
subj repl nointerested
1 5 1000
2 6 0
3 40 10
结果:
df1转换为:
subj abc
1 5 5 1 #the 2 fives通过查找出现
1 2 3 5
2 0 6 2 #the 6出现
3 8 8 8

我想出了以下代码:

  df1 [,mycols] [is.na(df1 [,myc ols])]<  -  df2 [match(df1 $ subj,df2 $ subj),repl] 

但问题是(我认为),右侧的大小与左侧的大小不一样 - 我认为这可能适用于 mycols 中的一列,但是我想对所有 mycols (如果 NA )执行相同的操作,请查看表 df2 并替换 - 替换值在行)



(另外我还需要以明确的方式枚举这些列,因为可能有其他列)



作为一个关于编程风格的限制的分类 - 在R中,写一个这个操作的好方法和快速方法是什么?如果它是一个程序语言,我们可以转换

  df1 [,mycols] [is.na(df1 [,mycols] )] 

成一种方法,我认为更好,更可读:

  function(x){* x [is.na(* x)]} 
函数(& df1 [,mycols])

并确保没有任何东西从一个地方到另一个地方不必要地复制。

解决方案

使用你的代码,我们需要复制'repl'列,使两个子集数据集相等,然后按照您的方式分配值

  val < -  df2 $ repl [match(df1 $ subj,df2 $ subj)] [行(df1 [mycols])] [is.na(df1 [mycols])] 
df1 [mycols] [is.na(df1 [mycols])]< - val
df1
#subj abc
#1 1 5 5 1
#2 1 2 3 5
#3 2 0 6 2
#4 3 8 8 8

另一个选项使用 data.table

 库(数据.table)#v1.9.5 + 
DT< - setDT(df1,key ='subj')[df2 [c('subj','repl')]]
for(j in mycols ){
i1 < - which(is.na(DT [[j]]))
set(DT,i = i1,j = j,value = DT [['repl']] [i1]​​)$ ​​b $ b}
DT [,repl:= NULL]
#subj abc
#1:1 5 5 1
#2:1 2 3 5
#3:2 0 6 2
#4:3 8 8 8

或$ dplyr

  library(dplyr)
left_join (df1,df2,by ='subj')%>%
mutate_each_(funs(ifelse(is.na(。),repl ...)),mycols)%>%
select a:c)
#abc
#1 5 5 1
#2 2 3 5
#3 0 6 2
#4 8 8 8



数据



  df1<  -  structure(list(subj = c(1L,1L,2L,3L),a = c(NA,2L,0L,8L 
),b = c(NA,3L,NA,8L) ,c = c(1L,5L,2L,8L)),.Names = c(subj,
a,b,c),class =data.frame .names = c(NA,-4L))

df2< - structure(list(subj = 1:3,repl = c(5L,6L,40L),
notinterested = c(1000L,
0L,10L)).Names = c(subj,repl不感兴趣),
class =data.frame,row.names = c(NA,-3L))


Lets say we do have lots of data columns (with names mycols and also some unnamed ones that should not be processed in this case) in dataframe df1 and a column subj which is also an index to another dataframe df2 with columns repl and subj (in this second dataframe is subj unique) and much other nonimportant columns (their only role in this is, that we cannot suppose that there are just 2 columns).

I would like to replace a subset of columns ( df1[,mycols] ) in such a way, that if there is an NA ( df1[,mycols][is.na(df1[,mycols])] ) <- replace by a value of column df2$repl where the row in df2 has df2$subj = df1$subj.

EDIT: example data (I dont know the command to write it as dataframe assignment):

mycols = c("a","b")
df1:
subj a  b  c
1    NA NA 1
1    2  3  5
2    0  NA 2
3    8  8  8
df2:
subj repl notinterested
1     5    1000
2     6    0
3     40   10
result:
df1-transformed-to:
subj a  b  c
1    5  5  1      #the 2 fives appeared by lookup
1    2  3  5
2    0  6  2     #the 6 appeared
3    8  8  8

I came up with the following code:

df1[,mycols][is.na(df1[,mycols])] <- df2[match( df1$subj, df2$subj),"repl"] 

But the problem is (I think), that the right side is not the same size as the left side - I think it might work for one column in "mycols", but I want to do the same operation with all mycols (If NA, look to table df2 and replace - the replacing value is the same in the scope of the row).

(Also I need to enumerate the columns by names mycols explicitely everythime, because there might be another columns)

As a miniquestion as bonus about programming style - what is, in R, a good and a fast way to write this operation? If it would be a procedural language, we could transform

df1[,mycols][is.na(df1[,mycols])]

into an approach I consider more nice and more readable:

function(x){ *x[is.na(*x)] }
function(& df1[,mycols]) 

and being sure, that nothing gets unnecessarily copied from place to place.

解决方案

Using your code, we need to replicate the 'repl' column to make the two subset datasets equal and then assign the values as you did

 val <- df2$repl[match(df1$subj, df2$subj)][row(df1[mycols])][is.na(df1[mycols])]
 df1[mycols][is.na(df1[mycols])] <- val
 df1
 #  subj a b c
 #1    1 5 5 1
 #2    1 2 3 5
 #3    2 0 6 2
 #4    3 8 8 8

Another option using data.table

 library(data.table)#v1.9.5+
 DT <- setDT(df1, key='subj')[df2[c('subj', 'repl')]]
 for(j in mycols){
   i1 <- which(is.na(DT[[j]]))
   set(DT, i=i1, j=j, value= DT[['repl']][i1])
   }
 DT[,repl:= NULL]
 #   subj a b c
 #1:    1 5 5 1
 #2:    1 2 3 5
 #3:    2 0 6 2
 #4:    3 8 8 8

Or with dplyr

 library(dplyr)
 left_join(df1, df2, by='subj') %>%
        mutate_each_(funs(ifelse(is.na(.),repl,.)), mycols) %>% 
        select(a:c)
 #  a b c
 #1 5 5 1
 #2 2 3 5
 #3 0 6 2
 #4 8 8 8

data

 df1 <-  structure(list(subj = c(1L, 1L, 2L, 3L), a = c(NA, 2L, 0L, 8L 
 ), b = c(NA, 3L, NA, 8L), c = c(1L, 5L, 2L, 8L)), .Names = c("subj", 
 "a", "b", "c"), class = "data.frame", row.names = c(NA, -4L))

 df2 <- structure(list(subj = 1:3, repl = c(5L, 6L, 40L),
 notinterested = c(1000L, 
 0L, 10L)), .Names = c("subj", "repl", "notinterested"), 
 class = "data.frame", row.names = c(NA, -3L))

这篇关于R条件通过查找更换列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆