累积粘贴(连接)按另一个变量分组的值 [英] Cumulatively paste (concatenate) values grouped by another variable
问题描述
我在处理 R 中的数据框时遇到问题.我想根据另一列中单元格的值将不同行中单元格的内容粘贴在一起.我的问题是我希望逐步(累积)打印输出.输出向量必须与输入向量的长度相同.这是一个类似于我正在处理的样本表:
I have a problem dealing with a data frame in R. I would like to paste the contents of cells in different rows together based on the values of the cells in another column. My problem is that I want the output to be progressively (cumulatively) printed. The output vector must be of the same length as the input vector. Here is a sampel table similar to the one I am dealing with:
id <- c("a", "a", "a", "b", "b", "b")
content <- c("A", "B", "A", "B", "C", "B")
(testdf <- data.frame(id, content, stringsAsFactors=FALSE))
# id content
#1 a A
#2 a B
#3 a A
#4 b B
#5 b C
#6 b B
这是我想要的结果:
result <- c("A", "A B", "A B A", "B", "B C", "B C B")
result
#[1] "A" "A B" "A B A" "B" "B C" "B C B"
我不需要这样的东西:
ddply(testdf, .(id), summarize, content_concatenated = paste(content, collapse = " "))
# id content_concatenated
#1 a A B A
#2 b B C B
推荐答案
您可以使用 Reduce
定义累积粘贴"功能:
You could define a "cumulative paste" function using Reduce
:
cumpaste = function(x, .sep = " ")
Reduce(function(x1, x2) paste(x1, x2, sep = .sep), x, accumulate = TRUE)
cumpaste(letters[1:3], "; ")
#[1] "a" "a; b" "a; b; c"
Reduce
的循环避免从一开始就重新连接元素,因为它会通过下一个元素延长前一个连接.
Reduce
's loop avoids re-concatenating elements from the start as it elongates the previous concatenation by the next element.
按组应用:
ave(as.character(testdf$content), testdf$id, FUN = cumpaste)
#[1] "A" "A B" "A B A" "B" "B C" "B C B"
另一个想法,可以在开始时连接整个向量,然后逐渐substring
:
Another idea, could to concatenate the whole vector at start and, then, progressively substring
:
cumpaste2 = function(x, .sep = " ")
{
concat = paste(x, collapse = .sep)
substring(concat, 1L, cumsum(c(nchar(x[[1L]]), nchar(x[-1L]) + nchar(.sep))))
}
cumpaste2(letters[1:3], " ;@-")
#[1] "a" "a ;@-b" "a ;@-b ;@-c"
这似乎也有点快:
set.seed(077)
X = replicate(1e3, paste(sample(letters, sample(0:5, 1), TRUE), collapse = ""))
identical(cumpaste(X, " --- "), cumpaste2(X, " --- "))
#[1] TRUE
microbenchmark::microbenchmark(cumpaste(X, " --- "), cumpaste2(X, " --- "), times = 30)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# cumpaste(X, " --- ") 21.19967 21.82295 26.47899 24.83196 30.34068 39.86275 30 b
# cumpaste2(X, " --- ") 14.41291 14.92378 16.87865 16.03339 18.56703 23.22958 30 a
...这使它成为cumpaste_faster
.
这篇关于累积粘贴(连接)按另一个变量分组的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!