用另一个信息填充一个数据框中的丢失数据 [英] filling in missing data in one data frame with info from another

查看:81
本文介绍了用另一个信息填充一个数据框中的丢失数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有两个数据集,即A& B,如下所示:

There are two data set, A & B, as below:

A <- data.frame(TICKER=c("00EY","00EY","00EY","00EY","00EY"), 
                CUSIP=c(NA,NA,"48205A10","48205A10","48205A10"), 
                OFTIC=c(NA,NA,"JUNO","JUNO","JUNO"), 
                CNAME=c(NA,NA, "JUNO", "JUNO","JUNO"), 
                ANNDATS=c("2015-01-13","2015-01-13","2015-01-13","2015-01-13","2015-01-13"),
                ANALYS=c(00076659,00105887,00153117,00148921,00086659),
                stringsAsFactors = F)

B <- data.frame(TICKER=c("00EY","00EY","00EY","00EY"), 
                CUSIP=c("48205A10","48205A10","48205A10","48205A10"),
                OFTIC=c("JUNO","JUNO",NA,NA), 
                CNAME=c("JUNO","JUNO", NA, NA), 
                ANNDATS=c("2015-01-13","2015-01-13","2015-01-13","2015-01-13"), 
                ANALYS=c(00076659,00105887,00153117,00148921), 
                stringsAsFactors = F)

如何将一个数据框中的缺失数据与另一数据框中的信息一起填写? (A和B数据集的长度不同).

How can I fill in missing data in one data frame with info from another? (A & B data sets are not of the same length).

推荐答案

由于两个数据集可以具有不同的长度,因此您需要一些可以通过它们连接的功能.好像ANALYS是某种标识符,在此示例中,我们可以使用它来连接两个data.frames.

Since the two data sets can have different lengths, you need some feature which they can be connected by. As it seems ANALYS is some kind of identifier, we can use it to connect the two data.frames in this example.

首先,我们确定df1(即A)中的所有缺失项,并获取它们的索引(行和列).
然后,将df1中的缺失值替换为df2中与具有相同ANALYS值的行相对应的值.如果此ID在df2中不可用,则该行将被跳过.

First we identify all missings in df1 (i.e. A) and acquire their indices (rows and cols).
Then, the missings in df1 are subsituted by the values in df2 corresponding to the line with the same value of ANALYS. If this ID is not available in df2, the line will be skipped.

f <- function(df1, df2){
  missings <- sapply(df1, is.na)
  missingsInd <- which(missings, arr.ind = T)

  for(i in 1:nrow(missingsInd)){
    idOfMissing <- df1$ANALYS[missingsInd[i,1]]
    correspondingLine <- df2[which(df2$ANALYS == idOfMissing), ]
    if (nrow(correspondingLine) != 0) {
      df1[missingsInd[i,1], missingsInd[i,2]] <-  correspondingLine[1,missingsInd[i,2]]
    }
  }
  df1
}
f(A, B)
#   TICKER    CUSIP OFTIC CNAME    ANNDATS ANALYS IRECCD
# 1   00EY 48205A10  JUNO  JUNO 2015-01-13  76659      1
# 2   00EY 48205A10  JUNO  JUNO 2015-01-13 105887      2
# 3   00EY 48205A10  JUNO  JUNO 2015-01-13 153117      1
# 4   00EY 48205A10  JUNO  JUNO 2015-01-13 148921      3
# 5   00EY 48205A10  JUNO  JUNO 2015-01-13  86659      4

注意,两个data.frames中具有NA的单元格将作为NA返回,如输出所示.此外,这仅在ANALYS仅在两个data.frame中包含唯一值的情况下适用.

Note that cells with NAs within both data.frames will return as NA as in the output. Furthermore, this only applies if ANALYS only holds unique values within both data.frames.

这篇关于用另一个信息填充一个数据框中的丢失数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆