一行中的几个替换 R [英] several substitutions in one line R

查看:30
本文介绍了一行中的几个替换 R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 的数据框中有一列,其值为-1"、0"、1".我想分别用否"、可能"和是"替换这些值.我将通过使用 sub 来做到这一点.

I have a column in a dataframe in R with values "-1","0","1". I'd like to replace these values with "no", "maybe" and "yes" respectively. I'll do this by using sub.

我可以写一个条件函数,然后写代码:

I could write a conditional function, and then code:

    df[col] <- lapply(df[col], conditional_function_substitution)

我也可以一次进行一个替换(三个中的第一个示例):

I could also do the substitutions one at a time (example of first of three):

   df[col] <- lapply(df[col], sub, pattern = '-1', replacement = "no")

我想知道是否可以在一行中完成?类似的东西:

I'm wondering if it can be done in one line? Something like:

   df[col] <- lapply(df[col], sub, pattern = c('-1','0','1'), replacement = c('no','maybe','yes')

感谢您的洞察力!

推荐答案

通过将 2 添加到 -1、0 和 1,您可以将索引转化为期望结果的向量:

By adding 2 to -1, 0, and 1, you could get indices into a vector of the desired outcomes:

c("no", "maybe", "yes")[dat + 2]
# [1] "no"    "yes"   "maybe" "yes"   "yes"   "no"  

相关选项可以使用 match 函数来计算索引:

A related option could make use of the match function to figure out the indexing:

c("no", "maybe", "yes")[match(dat, -1:1)]
# [1] "no"    "yes"   "maybe" "yes"   "yes"   "no"  

或者,您可以使用命名向量进行重新编码:

Alternately, you could use a named vector for recoding:

unname(c("-1"="no", "0"="maybe", "1"="yes")[as.character(dat)])
# [1] "no"    "yes"   "maybe" "yes"   "yes"   "no"   

你也可以使用嵌套的ifelse:

ifelse(dat == -1, "no", ifelse(dat == 0, "maybe", "yes"))
# [1] "no"    "yes"   "maybe" "yes"   "yes"   "no"   

如果你不介意加载一个新的包,car 包中的 Recode 函数会这样做:

If you don't mind loading a new package, the Recode function from the car package does this:

library(car)
Recode(dat, "-1='no'; 0='maybe'; 1='yes'")
# [1] "no"    "yes"   "maybe" "yes"   "yes"   "no"  

数据:

dat <- c(-1, 1, 0, 1, 1, -1)

请注意,如果将 dat 存储为字符串,则除第一个之外的所有内容都将起作用;首先,您需要使用 as.numeric(dat).

Note that all but the first will work if dat were stored as a string; in the first you would need to use as.numeric(dat).

如果代码清晰是您的主要目标,那么您应该选择您认为最容易理解的那个——我个人会选择第二个或最后一个,但这是个人偏好.

If code clarity is your main objective, then you should pick the one that you find easiest to understand -- I would personally pick the second or last but that is personal preference.

如果对代码速度感兴趣,那么您可以对解决方案进行基准测试.这是我提出的五个选项的基准,还包括目前作为其他答案发布的其他两个解决方案,以长度为 100k 的随机向量为基准:

If code speed is of interest, then you can benchmark the solutions. Here's the benchmarks of the five options I've presented, also including the two other solutions currently posted as other answers, benchmarked on a random vector of length 100k:

set.seed(144)
dat <- sample(c(-1, 0, 1), replace=TRUE, 100000)
opt1 <- function(dat) c("no", "maybe", "yes")[dat + 2]
opt2 <- function(dat) c("no", "maybe", "yes")[match(dat, -1:1)]
opt3 <- function(dat) unname(c("-1"="no", "0"="maybe", "1"="yes")[as.character(dat)])
opt4 <- function(dat) ifelse(dat == -1, "no", ifelse(dat == 0, "maybe", "yes"))
opt5 <- function(dat) Recode(dat, "-1='no'; 0='maybe'; 1='yes'")
AnandaMahto <- function(dat) factor(dat, levels = c(-1, 0, 1), labels = c("no", "maybe", "yes"))
hrbrmstr <- function(dat) sapply(as.character(dat), switch, `-1`="no", `0`="maybe", `1`="yes", USE.NAMES=FALSE)
library(microbenchmark)
microbenchmark(opt1(dat), opt2(dat), opt3(dat), opt4(dat), opt5(dat), AnandaMahto(dat), hrbrmstr(dat))
# Unit: milliseconds
#              expr        min         lq       mean     median         uq        max neval
#         opt1(dat)   1.513500   2.553022   2.763685   2.656010   2.837673   4.384149   100
#         opt2(dat)   2.153438   3.013502   3.251850   3.117058   3.269230   5.851234   100
#         opt3(dat)  59.716271  61.890470  64.978685  62.509046  63.723048 144.708757   100
#         opt4(dat)  62.934734  64.715815  71.181477  65.652195  71.123384 123.840577   100
#         opt5(dat)  82.976441  84.849147  89.071808  85.752429  88.473162 155.347273   100
#  AnandaMahto(dat)  57.267227  58.643889  60.508402  59.065642  60.368913  80.852157   100
#     hrbrmstr(dat) 137.883307 148.626496 158.051220 153.441243 162.594752 228.271336   100

前两个选项似乎比任何其他选项快一个数量级以上,尽管向量必须非常大,或者您需要对任何一个选项重复多次操作这是为了有所作为.

The first two options appear to be more than an order of magnitude quicker than any of the other options, though either the vector would have to be pretty huge or you would need to be repeating the operation a number of times for any of this to make a difference.

正如@AnandaMahto 所指出的,如果我们使用字符输入而不是数字输入,这些结果会有质的不同:

As pointed out by @AnandaMahto, these results are qualitatively different if we have character input instead of numeric input:

set.seed(144)
dat <- sample(c("-1", "0", "1"), replace=TRUE, 100000)
opt1 <- function(dat) c("no", "maybe", "yes")[as.numeric(dat) + 2]
opt2 <- function(dat) c("no", "maybe", "yes")[match(dat, -1:1)]
opt3 <- function(dat) unname(c("-1"="no", "0"="maybe", "1"="yes")[as.character(dat)])
opt4 <- function(dat) ifelse(dat == -1, "no", ifelse(dat == 0, "maybe", "yes"))
opt5 <- function(dat) Recode(dat, "-1='no'; 0='maybe'; 1='yes'")
AnandaMahto <- function(dat) factor(dat, levels = c(-1, 0, 1), labels = c("no", "maybe", "yes"))
hrbrmstr <- function(dat) sapply(dat, switch, `-1`="no", `0`="maybe", `1`="yes", USE.NAMES=FALSE)
library(microbenchmark)
microbenchmark(opt1(dat), opt2(dat), opt3(dat), opt4(dat), opt5(dat), AnandaMahto(dat), hrbrmstr(dat))
# Unit: milliseconds
#              expr       min        lq       mean     median         uq        max neval
#         opt1(dat)  8.397194  9.519075  10.784108   9.693706  10.163203   55.78417   100
#         opt2(dat)  2.281438  3.091418   4.231162   3.210794   3.436038   49.39879   100
#         opt3(dat)  3.606863  5.481115   6.466393   5.720282   6.344651   48.47924   100
#         opt4(dat) 66.819638 69.996704  74.596960  71.290522  73.404043  127.52415   100
#         opt5(dat) 32.897019 35.701401  38.488489  36.336489  38.950272   88.20915   100
#  AnandaMahto(dat)  1.329443  2.114504   2.824306   2.275736   2.493907   46.19333   100
#     hrbrmstr(dat) 81.898572 91.043729 154.331766 100.006203 141.425717 1594.17447   100

现在,@AnandaMahto 提出的 factor 解决方案是最快的,其次是使用 match 进行向量索引和命名向量查找.同样,所有运行时都足够快,因此您需要一个大向量或多次运行才能使其中任何一个变得重要.

Now, the factor solution proposed by @AnandaMahto is the quickest, followed by vector indexing with match and named vector lookup. Again, all runtimes are fast enough that you would need a large vector or many runs for any of this to matter.

这篇关于一行中的几个替换 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆