如何在R中为近似相同的行填充NA? [英] How to fill NA in R for quasi-same row?

查看:118
本文介绍了如何在R中为近似相同的行填充NA?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种在duplicated()行中填充NA的方法.总共有相同的行,一次有一个NA,所以我决定用完整行的值填充此行,但我看不到如何处理.

I'm looking for a way to fillNA in duplicated() rows. There are totally same rows and at one time there is a NA, so I decide to fill this one by value of complete row but I don't see how to deal with it.

使用duplicated()函数,我可以得到一个像这样的数据帧:

Using the duplicated() function, I could have a data frame like that:

 df <- data.frame(
   Year = rnorm(5), 
   hour = rnorm(5), 
   LOT = rnorm(5), 
   S123_AA = c('ABF4576','ABF4576','ABF4576','ABF4576','ABF4576'), 
   S135_AA = c('ABF5403',NA,'ABF5403','ABF5403','ABF5403'), 
   S13_BB = c('BF50343','BF50343','BF50343','BF50343',NA),  
   S1763_BB = c('AA3489','AA3489','AA3489','AA3489','AA3489'), 
   S173_BB = c('BQA0478','BQA0478','BQA0478','BQA0478','BQA0478'),
   S234543 = c('AD4352','AD4352','AD4352','AD4352','AD4352'),
   S1265UU5 = c('AZERTY', 'AZERTY', 'AZERTY', 'AZERTY','AZERTY')
 )

行是相似的,那么如何通过前面的raw值(不是NA)来感觉NA?没有complete.cases()行.

The rows are similar, so how could I feel the NA by the value of the preceding raw (which is not an NA) ? There is no complete.cases()rows.

推荐答案

阅读您的问题使我想到了输入问题.

reading your question made me think of an imputation problem for the dataframe.

换句话说,您需要用某种值填充NA,以便能够保存"数据框中的记录.最简单的方法是通过搜索均值(当处理基数值时)或模式(当处理分类值时)来选择特定列的值[您也可以执行回归,但我想这是一个更复杂的方法]

In other terms you need to fill the NAs with some sort of value to be able to "save" records in the dataframe. The simplest way is to select the value of a particular column by searching the mean (when dealing with cardinal values) or the mode (when dealing with categorical values) [you may also execute a regression, but I guess it's a more complex method].

在这种情况下,我们可以选择模式替换,因为属性是分类的.通过运行代码,我们获得数据框df:

In this case we may choose the mode replacement because the attributes are categorical. By running your code we obtain the dataframe df:

         Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
1 -0.32837526  0.7930541 -1.10954824 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
2  0.55379245 -0.7320060 -0.95088434 ABF4576    <NA> BF50343   AA3489 BQA0478  AD4352   AZERTY
3  0.36442118  0.9920967 -0.07345038 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
5  1.92550340 -1.0531371  0.88318695 ABF4576 ABF5403    <NA>   AA3489 BQA0478  AD4352   AZERTY

然后我们可以创建一个函数来计算特定列的模式:

We can then create a function to calculate the mode of a particular column:

getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

然后使用它来填充缺少的值.在代码下方为列S135_AA填充缺失值的代码(我创建了一个名为workdf的新数据框):

And then use it to fill the missing values. Below the code to impute the missing values for the column S135_AA (I created a new dataframe named workdf) :

workdf <- df
workdf[is.na(workdf$S135_AA),c('S135_AA')] <- getmode(workdf[,'S135_AA'])

这是输出,您可以在其中看到S135_AA NA列占该列的重复出现值最高:

This is the output where you can see that the column S135_AA NAs took the most recurring value of the colum:

         Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
1 -0.32837526  0.7930541 -1.10954824 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
2  0.55379245 -0.7320060 -0.95088434 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
3  0.36442118  0.9920967 -0.07345038 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
5  1.92550340 -1.0531371  0.88318695 ABF4576 ABF5403    <NA>   AA3489 BQA0478  AD4352   AZERTY

如果您的目标是清理数据,我想您应该使用一种估算方法来处理它.

If your objective was data cleaning I guess that you should use an imputation method to deal with it.

这篇关于如何在R中为近似相同的行填充NA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆