根据条件匹配另一列的向量替换一列中的值 [英] Replace values in one column based on a vector conditionally matching another column

查看:96
本文介绍了根据条件匹配另一列的向量替换一列中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据帧,并且我想根据波长值是否落入确定为不良测量值的某个范围组(badData向量)中,用NA替换反射率值.

I have the following data frame and I want to replace the reflectance values with NA depending on whether or not a wavelength value falls in a certain grouping of ranges that were determined to be bad measurements (badData vector).

错误数据的范围可能会随时间变化,因此我希望解决方案尽可能通用.

The ranges of bad data might change over time so I would like the solution to be as general as possible.

  badData <- c(296:310, 330:335, 350:565)

  df <- data.frame(wavelength = seq(300,360,5.008667),
                  reflectance = seq(-1,-61,-5.008667))

df 

   wavelength reflectance
   300.0000   -1.000000
   305.0087   -6.008667
   310.0173  -11.017334
   315.0260  -16.026001
   320.0347  -21.034668
   325.0433  -26.043335
   330.0520  -31.052002
   335.0607  -36.060669
   340.0693  -41.069336
   345.0780  -46.078003
   350.0867  -51.086670
   355.0953  -56.095337

我尝试过

   Data2 <- df %>% 
  mutate(reflectance = replace(reflectance,wavelength %in% badData, NA))

但是因为我试图用波长范围而不是确切的值来做到这一点,所以这是行不通的.我想我应该使用条件语句,但是我不知道如何最有效地将具有不同范围分组的向量馈入.

But because I am trying to do this with wavelength ranges rather than exact values this will not work. I am thinking I should use a conditional statement, but I do not know how to feed a vector with different groupings of ranges through that most efficiently.

输出数据集是因为波长300.000和305.0087在296和310之间,波长330.05620在330和335之间,而350.0867和355.0953在350:565之间.

The output dataset would be because wavelengths 300.000 and 305.0087 fall between 296 and 310, wavelength 330.05620 is between 330 and 335 and 350.0867 and 355.0953 fall between 350:565.

 wavelength reflectance
   300.0000   NA
   305.0087   NA
   310.0173  -11.017334
   315.0260  -16.026001
   320.0347  -21.034668
   325.0433  -26.043335
   330.0520  NA
   335.0607  -36.060669
   340.0693  -41.069336
   345.0780  -46.078003
   350.0867  NA
   355.0953  NA

推荐答案

第一步是要认识到定义整数范围是行不通的.相反,我将列出一个数字对列表:

The first step is to realize that defining ranges of integers will not work. Instead, I'll go with a list of number pairs:

badData <- list(c(296,310), c(330,335), c(350,565))

,我们希望检查每个$wavelength是否在这三个范围内.支持更多范围.

with the understanding that we want to check each $wavelength to be within any of these three ranges. More ranges are supported.

我们可以做的第二件事是编写一个函数,该函数检查值的向量是否在一对或多对数字内. (在此示例中,我们知道"不会超过一个,但这并不重要.)

The second thing we can do is write a function that checks if a vector of values is within one or more pairs of numbers. (In this example, we "know" that it will not be in more than one, but that's not critical.)

within_ranges <- function(x, lims)  {
  Reduce(`|`, lapply(lims, function(lim) lim[1] <= x & x <= lim[2]))
}

要了解其作用,请对其进行调试,调用并查看发生的情况.

To understand what this is doing, let's debug it, call it, and see what's going on.

debugonce(within_ranges)
within_ranges(df$wavelength, badData)
# debugging in: within_ranges(df$wavelength, badData)
# debug at #1: {
#     Reduce(`|`, lapply(lims, function(lim) lim[1] <= x & x <= 
#         lim[2]))
# }

让我们运行内部部分:

# Browse[2]> 
lapply(lims, function(lim) lim[1] <= x & x <= lim[2])
# [[1]]
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [[2]]
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# [[3]]
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

第一个元素(T,T,F,F,...)是值(x)是否落在第一个数字对(296至310)之内;具有第二对的第二元件(330至335);等

So the first element (T,T,F,F,...) is whether the values (x) fall within the first number pair (296 to 310); the second element with the second pair (330 to 335); etc.

Reduce(部分在第一个参数上调用第一个参数,即一个函数,保存返回值,然后在return和第三个参数上运行相同的函数.它存储它,然后在return和第四个参数(如果存在)上运行相同的函数.它将在提供的列表的整个长度上重复此操作.

The Reduce( part calls the first argument, a function, on the first two arguments, saves the return, and then runs the same function on the return and the third argument. It stores it, then runs the same function on the return and fourth argument (if exists). It repeats this along the entire length of the provided list.

在此示例中,该函数为文字|(由于特殊,因此将其转义),因此它会将[[1]]向量与[[2]]向量进行或"运算.如果添加accumulate=TRUE,您实际上可以看到发生了什么:

In this example, the function is the literal | (escaped since it is special), so it is "OR"ing the [[1]] vector with the [[2]] vector. You can actually see what is happening if you add accumulate=TRUE:

# Browse[2]> 
Reduce(`|`, lapply(lims, function(lim) lim[1] <= x & x <= lim[2]), accumulate=TRUE)
# [[1]]
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [[2]]
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# [[3]]
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

第一个返回是未修改的第一个向量.第二个元素是原始[[2]]向量,与先前的返回值或 this [[1]]向量(与原始[[1]]相同)进行或运算.第三个元素是原始[[3]]向量,与前一个返回值或为 this [[2]].这将导致您期望的三个TRUE分组(1、2、7、11、12).因此,我们需要[[3]]元素,这是我们在不累积的情况下得到的:

The first return is the first vector, unmodified. The second element is the original [[2]] vector ORed with the previous return which is this [[1]] vector (which is the same as the original [[1]]). The third element is the original [[3]] vector ORed with the previous return, which is this [[2]]. This results in the three groupings of TRUE (1, 2, 7, 11, 12) that you are expecting. So we want the [[3]] element, which is what we get without accumulating:

# Browse[2]> 
Reduce(`|`, lapply(lims, function(lim) lim[1] <= x & x <= lim[2]))
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

好吧,所以让我们Q从调试器中退出,并尝试一下:

Okay, so let's Quit out of the debugger, and give it a full go:

within_ranges(df$wavelength, badData)
#  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

此输出看起来很熟悉.

(顺便说一句:在我们的函数中,我们也可以使用

(BTW: inside our function, we could also have used

rowSums(sapply(lims, ...)) > 0

,它也可以正常工作.为此,尽管如此,您需要意识到sapply应该返回的matrix列的数量与df中数据行的数量一样多,如果您不熟悉,则为奇数.)

and it would have worked just as well. For that, though, you need to realize that sapply should be returning a matrix with as many columns as we have rows of data in df, odd if you aren't familiar.)

现在,我们可以使用dplyrNA确定所需的内容:

Now, we can NAify what we need to either with dplyr:

df %>%
  mutate(
    reflectance = if_else(within_ranges(wavelength, badData), NA_real_, reflectance)
  )
#    wavelength reflectance
# 1    300.0000          NA
# 2    305.0087          NA
# 3    310.0173   -11.01733
# 4    315.0260   -16.02600
# 5    320.0347   -21.03467
# 6    325.0433   -26.04333
# 7    330.0520          NA
# 8    335.0607   -36.06067
# 9    340.0693   -41.06934
# 10   345.0780   -46.07800
# 11   350.0867          NA
# 12   355.0953          NA

编辑:或其他dplyr,使用的是您对replace的第一个想法(不是我的第一个习惯,没有理由):

Edit: or another dplyr, using your first thought of replace (not my first by habit, no reason):

df %>%
  mutate(
    reflectance = replace(reflectance, within_ranges(wavelength, badData), NA_real_)
  )

或基数R:

df$reflectance <- ifelse(within_ranges(df$wavelength, badData), NA_real_, df$reflectance)
df
#    wavelength reflectance
# 1    300.0000          NA
# 2    305.0087          NA
# 3    310.0173   -11.01733
# 4    315.0260   -16.02600
# 5    320.0347   -21.03467
# 6    325.0433   -26.04333
# 7    330.0520          NA
# 8    335.0607   -36.06067
# 9    340.0693   -41.06934
# 10   345.0780   -46.07800
# 11   350.0867          NA
# 12   355.0953          NA


注意:


Notes:

  • 我专门使用NA_real,这是为了清楚起见(您是否知道NA的类型不同吗?),部分原因是在使用dplyr::if_else时,如果"true"和"false"自变量不同(从技术上讲,NAlogical,而不是numeric);
  • 在第一个示例中,我使用dplyr::if_else,因为您已经在使用dplyr,但是如果您选择放弃dplyr(或其他人这样做),那么base-R ifelse可以工作,也. (它有责任,但似乎在这里可以正常工作.)
  • I'm specifically using NA_real, both for clarity (did you know there are different types of NA?), and partly because in the use of dplyr::if_else, it will complain/fail if the classes of the "true" and "false" arguments are not the same (NA is technically logical, not numeric as your reflectance is);
  • I use dplyr::if_else for the first example, since you're already using dplyr, but in case you choose to forego dplyr (or somebody else does), then the base-R ifelse works, too. (It has its liabilities, but it appears to work just fine here.)

这篇关于根据条件匹配另一列的向量替换一列中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆