创建一个函数以将一个data.frame中的NA替换为另一个中的值 [英] Creating a function to replace NAs from one data.frame with values from another

查看:94
本文介绍了创建一个函数以将一个data.frame中的NA替换为另一个中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常遇到以下情况:我需要用聚合级别不同的其他data.frame中的值替换data.frame中的缺失值.因此,例如,如果我有一个充满县数据的data.frame,我可能会用存储在另一个data.frame中的状态值替换NA值.在写完相同的merge ... ifelse(is.na()) yada yada后,我决定分解并编写一个函数来执行此操作.

I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.

这是我烹饪的食物,以及使用方法的示例:

Here's what I cooked up, along with an example of how I use it:

fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
 mergedDf <- merge(naDf, fillDf, by=mergeCols)
 for (col in fillCols){
   colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
   colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
   k <- which( is.na( colWithNas ) )
   colWithNas[k] <- colWithOutNas[k]
   mergedDf[col] <- colWithNas
   mergedDf[[paste(col, "x", sep=".")]] <- NULL
   mergedDf[[paste(col, "y", sep=".")]] <- NULL
 }
 return(mergedDf)
}

## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )

因此,在我开始跑步后,我有一种奇怪的感觉,那就是有人可能以一种更为优雅的方式解决了我面前的这个问题.是否有更好/更轻松/更快的解决方案来解决此问题?另外,有没有一种方法可以消除函数中间的循环?之所以存在该循环,是因为我经常在不止一列中替换NA.而且,是的,该函数假定我们要填充的列的名称相同,并且我们要填充的列也适用于合并.

So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.

任何指导或重构都将有所帮助.

Any guidance or refactoring would be helpful.

编辑 ,我在12月2日意识到自己在修复的示例中存在逻辑缺陷.

EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.

推荐答案

真是个好问题.

这是一个data.table解决方案:

# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))


# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]    
#      a b  f  g f.1 g.1
# [1,] 1 3 NA  0 100  11
# [2,] 1 3 NA NA 100  11
# [3,] 1 3 NA  0 100  11
# [4,] 1 3  0  0 100  11
# [5,] 1 3  0 NA 100  11
# First 5 rows of 200 printed.

# In outDT[i, j], on the following two lines 
#   -- i is a Boolean vector indicating which rows will be operated on
#   -- j is an expression saying "(sub)assign from right column (e.g. f.1) to 
#        left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]

# Just keep the four columns ultimately needed   
outDT <- outDT[,list(a,b,g,f)]
#       a b  g   f
#  [1,] 1 3  0   0
#  [2,] 1 3 11   0
#  [3,] 1 3  0   0
#  [4,] 1 3 11   0
#  [5,] 1 3 11   0
# First 5 rows of 200 printed.

这篇关于创建一个函数以将一个data.frame中的NA替换为另一个中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆