基于一列重新编码数据帧 [英] Recode dataframe based on one column

查看:135
本文介绍了基于一列重新编码数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个5845 * 1095(行*列)数据框,如下所示:

I have a 5845*1095 (rows*columns) data frame that looks like this:

 9  286593   C     C/C     C/A     A/A
 9  334337   A     A/A     G/A     A/A
 9  390512   C     C/C     C/C     C/C

c <-  c("9", "286593", "C", "C/C", "C/A", "A/A") 
d <-  c("9", "334337", "A", "A/A", "G/A", "A/A")
e <-   c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))

我想要使用第三列中的值来将列更改为右,所以如果(每行1)第3列是C,则列4从C / C转换为0,因为它有相同的信。一字母匹配是1(可以是第一个或第二个字母),没有字母匹配是2。

I want the values in the third column to be used to change the columns to its right so if (per row 1) column 3 is "C", then column 4 is turned from "C/C" to "0" as it has the same letter. One letter match is "1" (can be first or second letter) and no letter match is "2" .

9 286593  C  0  1  2
9 334337  A  0  1  0
9 390512  C  0  0  0 

c <-  c("9", "286593", "C", "0", "1", "2") 
d <-  c("9", "334337", "A", "0", " 1", "0")
e <-   c("9", "390512", "C", "0", "0", "0")
dat <- data.frame(rbind(c,d,e))

我有兴趣看到最好的方法,因为我想摆脱习惯使用嵌套For R 中的循环。

I am interested to see the best way to do this as I want to get out of the habit of using nested For loops in R.

推荐答案

首先你的数据:

c <-  c("9", "286593", "C", "C/C", "C/A", "A/A")
# Note: In your original data, you had a space in "G/A", which I did remove. 
# If this was no mistake, we would also have to deal with the space.
d <-  c("9", "334337", "A", "A/A", "G/A", "A/A")
e <-   c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))

现在我们生成一个包含所有可能的字母的向量。

Now we generate us a vector that has all the possible letters available.

values <- c("A", "C", "G", "T")
dat$X3 <- factor(dat$X3, levels=values) # This way we just ensure that it will later on be possible to compare the reference values to our generated data. 

# Generate all possible combinations of two letters
combinations <- expand.grid(f=values, s=values)
combinations <- cbind(combinations, v=with(combinations, paste(f, s, sep='/')))

找到每列的每个组合的正确列,然后将其与参考列3进行比较。

The main function finds the correct columns of each combination of each column and then compares this to the reference column 3.

compare <- function(col, val) {
    m <- match(col, combinations$v)
    2 - (combinations$f[m] == val) - (combinations$s[m] == val)
}

最后我们使用apply来对所有需要更改的列运行该函数。您可能希望将6更改为实际列数。

Finally we use apply to run the function on all columns that have to be changed. You probably want to change the 6 to your actual number of columns.

dat[,4:6] <- apply(dat[,4:6], 2, compare, val=dat[,3])

请注意,此解决方案与目前其他解决方案相比尚未使用字符串比较,但纯粹基于因子水平的方法。有趣的是看哪一个效果更好。

Note that this solution compared to the other solutions up to now does not use string comparison but an approach purely based on factor levels. Would be interesting to see which one performs better.

我刚做了一些基准测试:

I just did some benchmarking:

    test replications elapsed relative user.self sys.self user.child sys.child
1   arun      1000000   2.881    1.116     2.864    0.024          0         0
2  fabio      1000000   2.593    1.005     2.558    0.030          0         0
3 roland      1000000   2.727    1.057     2.687    0.048          0         0
5  thilo      1000000   2.581    1.000     2.540    0.036          0         0
4  tyler      1000000   2.663    1.032     2.626    0.042          0         0

更快地让我的版本稍微。然而,差异是接近无关,所以你可能是一切顺利。并且公平:我没有对我添加额外因子水平的部分进行基准测试。这样做也可能会排除我的版本。

which leaves my version slightly faster. However, the difference is close to nothing, so you are probably fine with every single approach. And to be fair: I did not benchmark the part where I add additional factor levels. Doing this as well would probably rule my version out.

这篇关于基于一列重新编码数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆