用na.approx插值数据帧中的NA值 [英] Interpolate NA values in a data frame with na.approx

查看:378
本文介绍了用na.approx插值数据帧中的NA值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过使用 na.approx()进行插值从数据框中删除 NA s,但是可以不会删除所有 NA s。

I am trying to remove NAs from my data frame by interpolation with na.approx() but can't remove all of the NAs.

我的数据框是4096x4096,其中270.15作为无效标志值。我需要数据在所有方面都是连续的,以提供气象模型。昨天,我询问并获得了有关如何替换基于另一个数据帧的数据帧中的值的答案。但是之后,我来到 na.approx(),然后决定用 NA 替换270.15的值,然后尝试 na.approx()进行数据插值。但是问题是为什么 na.approx()不能替换所有NA。

My data frame is a 4096x4096 with 270.15 as flag for non valid value. I need data to be continous in all points to feed a meteorological model. Yesterday I asked, and obtained an answer, on how to replace values in a data frame based in another data frame. But after that I came to na.approx() and then decided to replace the 270.15 values with NA and try na.approx() to interpolate data. But the question is why na.approx() does not replace all NAs.

这就是我在做什么:


  • 使用hdf5load读取原始的hdf文件

  • 将数据帧设置为子集(4094x4096)

  • 用NA替换标记值

  • Read the original hdf file with hdf5load
  • Subset the data frame (4094x4096)
  • Substitute flag value with NA

> sst4[sst4 == 270.15 ] = NA


  • 检查第一列(或其他任何列)

  • Check first column (or any other)

    > summary(sst4[,1])
    
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
    271.3   276.4   285.9   285.5   292.3   302.8  1345.0
    


  • 运行na.approx

  • Run na.approx

    > sst4=na.approx(sst4,na.rm="FALSE")
    


  • 先检查列

  • Check first column

    > summary(sst4[,1]) 
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
    271.3   276.5   286.3   285.9   292.6   302.8   411.0
    


  • 您会看到411个NA尚未删除。为什么?它们都对应于前导/结束列值吗?

    As you can see 411 NA's have not been removed. Why? Do they all correspond to leading/ending column values?

    head(sst4[,1])
    [1] NA NA NA NA NA NA
    tail(sst4[,1])
    [1] NA NA NA NA NA NA
    

    是它需要na.approx在NA插值前后具有有效值?我是否需要设置其他na.approx选项吗?

    Is it needed by na.approx to have valid values before and after NA to interpolate? Do I need to set any other na.approx option?

    非常感谢

    推荐答案

    一个可重现的小例子:

    library(zoo)
    set.seed(1)
    m <- matrix(runif(16, 0, 100), nrow = 4)
    missing_values <- sample(16, 7)
    m[missing_values] <- NA
    m
             [,1]     [,2]      [,3]     [,4]
    [1,] 26.55087 20.16819 62.911404 68.70228
    [2,] 37.21239       NA  6.178627 38.41037
    [3,]       NA       NA        NA       NA
    [4,] 90.82078 66.07978        NA       NA
    
    na.approx(m)
             [,1]     [,2]      [,3]     [,4]
    [1,] 26.55087 20.16819 62.911404 68.70228
    [2,] 37.21239 35.47206  6.178627 38.41037
    [3,] 64.01658 50.77592        NA       NA
    [4,] 90.82078 66.07978        NA       NA
    
    m[4, 4] <- 50
    na.approx(m)
             [,1]     [,2]      [,3]     [,4]
    [1,] 26.55087 20.16819 62.911404 68.70228
    [2,] 37.21239 35.47206  6.178627 38.41037
    [3,] 64.01658 50.77592        NA 44.20519
    [4,] 90.82078 66.07978        NA 50.00000
    

    是的,看起来您确实需要知道列的开始/结束值,否则插值无效。您可以猜测边界的值吗?

    Yup, looks like you do need the start/end values of columns to be known or the interpolation doesn't work. Can you guess values for your boundaries?

    另一个编辑:因此,默认情况下,您需要知道列的开始和结束值。但是,可以通过传递 rule = 2 来获取 na.approx 来始终填写空白。请参阅Felix的答案。根据Gabor的评论,您还可以使用 na.fill 提供默认值。最后,您可以在两个方向上插值边界条件(见下文),也可以猜测边界条件。

    ANOTHER So by default, you need the start and end values of columns to be known. However it is possible to get na.approx to always fill in the blanks by passing rule = 2. See Felix's answer. You can also use na.fill to provide a default value, as per Gabor's comment. Finally, you can interpolate boundary conditions in two directions (see below) or guess boundary conditions.

    编辑:需要进一步思考。由于 na.approx 仅在列中插值,并且您的数据是空间数据,因此也许在行中插值也很有用。那么您可以取平均值。

    A further thought. Since na.approx is only interpolating in columns, and your data is spacial, perhaps interpolating in rows would be useful too. Then you could take the average.

    na.approx 在整列为 NA时失败,因此我们创建了更大的数据集。

    na.approx fails when whole columns are NA, so we create a bigger dataset.

    set.seed(1)
    m <- matrix(runif(64, 0, 100), nrow = 8)
    missing_values <- sample(64, 15)
    m[missing_values] <- NA
    

    同时运行 na.approx

    by_col <- na.approx(m)
    by_row <- t(na.approx(t(m)))
    

    找出最佳猜测。

    default <- 50
    best_guess <- ifelse(is.na(by_row), 
      ifelse(
        is.na(by_col), 
        default,              #neither known
        by_col                #only by_col known
      ), 
      ifelse(
        is.na(by_col), 
        by_row,               #only by_row known
        (by_row + by_col) / 2 #both known
      )
    )
    

    这篇关于用na.approx插值数据帧中的NA值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆