R:合并列和具有相同列名的值 [英] R: merging columns and the values if they have the same column name
问题描述
我有一个超过 100 列的电子表格,并且许多列具有相同的名称.我想合并具有相同名称的那些列,并对这些列中的值进行行求和.我认为条件执行 if() 应该这样做,但我坚持为相同的列名编写条件?合并和求和列的函数是什么?合并()?还是 rowsum()?
I have a spreadsheet that have >100 columns, and many columns have the same names. I would like to merge those columns with the same names and row-sum the values in those columns. I think conditional execution, if(), should do it, but I'm stuck at writing the condition for identical column names? And what will be the function to merge and sum the columns? merge()? or rowsum()?
aa <- read.table()
aa <- read.table()
if (colnames(aa) == ) 合并/rowsum()
if (colnames(aa) == ) merge/rowsum()
谢谢.
这是现在的样例:
B C U B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
这就是我希望得到的:减少列数并对合并时的值求和.
And this is what I hope to get: Reduction of the number of columns and sum the values when merging.
B C U
2 2 1
4 4 2
6 6 3
8 8 4
推荐答案
解决方案 1
使用 split()
、lapply()
、rowSums()
和 do.call()
/cbind()
:
do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4
<小时>
解决方案 2
用 Reduce()
/`+`()
替换 rowSums()
调用:
do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4
<小时>
解决方案 3
直接拆分data.frame(作为未分类列表)替换索引向量中间人:
Solution 3
Replacing the index vector middleman with splitting the data.frame (as an unclassed list) directly:
do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4
<小时>
基准测试
library(microbenchmark);
bgoldst1 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
bgoldst2 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
bgoldst3 <- function(df) do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
sotos <- function(df) sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]));
<小时>
df <- data.frame(B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),U=c(1L,2L,3L,4L),B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),check.names=F);
ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE
microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst1(df) 245.473 258.3030 278.9499 272.4155 286.742 641.052 100
## bgoldst2(df) 156.949 166.3580 184.2206 171.7030 181.539 1042.618 100
## bgoldst3(df) 82.110 92.5875 100.9138 97.2915 107.128 170.207 100
## sotos(df) 200.997 211.9030 226.7977 223.6630 235.210 328.010 100
<小时>
set.seed(1L);
NR <- 1e3L; NC <- 1e3L;
df <- setNames(nm=LETTERS[sample(seq_along(LETTERS),NC,T)],data.frame(replicate(NC,sample(seq_len(NR*3L),NR,T))));
ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE
microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst1(df) 11.070218 11.586182 12.745706 12.870209 13.234997 16.15929 100
## bgoldst2(df) 4.534402 4.680446 6.161428 6.097900 6.425697 44.83254 100
## bgoldst3(df) 3.430203 3.555505 5.355128 4.919931 5.219930 41.79279 100
## sotos(df) 19.953848 21.419628 22.713282 21.829533 22.280279 60.86525 100
这篇关于R:合并列和具有相同列名的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!