各个列的摘要统计信息,其中列名表示组 [英] summary stats across columns, where column names indicate groups
问题描述
数据帧具有
包含遵循命名模式的数千个矢量.每个向量名称都包含一个名词,然后是 _a
, _b
或 _c
.以下是前10个var和obs:
Data frame have
includes a few thousand vectors that follow a naming pattern. Each vector name includes a noun, then either _a
, _b
, or _c
. Below are the first 10 vars and obs:
id turtle_a banana_a castle_a turtle_b banana_b castle_b turtle_c banana_c castle_c
A -0.58 -0.88 -0.56 -0.53 -0.32 -0.42 -0.52 -0.89 -0.72
B NA NA NA -0.84 -0.36 -0.26 NA NA NA
C 0.00 -0.43 -0.75 -0.35 -0.88 -0.14 -0.26 -0.15 -0.81
D -0.81 -0.63 -0.77 -0.82 -0.83 -0.50 -0.77 -0.25 -0.07
E -0.25 -0.33 -0.09 -0.51 -0.27 -0.81 -0.06 -0.23 -0.97
F -0.80 -0.88 -0.05 NA NA NA NA NA NA
G -0.25 -0.76 -0.21 NA NA NA NA NA NA
H -0.47 -0.10 -0.67 -0.46 -0.71 -0.24 -0.76 -0.04 -0.11
I -0.15 -0.34 -0.57 -0.40 -0.14 -0.49 NA NA NA
J -0.65 -0.86 -0.37 -0.67 -0.81 -0.63 NA NA NA
数据框架 want
是名词组中每组变量在所有列中的均值.例如,对 id
= A
的 turtle_a
, turtle_b
和 turtle_c
平均 -0.54
.如果我只是对示例中的少数名词组进行操作,这就是 want
的样子.
Data frame want
is the mean across all columns for every set of variables in a noun group. For example, averaging turtle_a
, turtle_b
, and turtle_c
for id
=A
equals -0.54
. Here's what want
looks like if I just do it for the handful of noun groups in the example.
id turtle_m banana_m castle_m
A -0.54 -0.70 -0.57
B -0.84 -0.36 -0.26
C -0.20 -0.49 -0.57
D -0.80 -0.57 -0.45
E -0.27 -0.28 -0.62
F -0.80 -0.88 -0.05
G -0.25 -0.76 -0.21
H -0.56 -0.29 -0.34
I -0.27 -0.24 -0.53
J -0.66 -0.83 -0.50
到目前为止的选项:
- 使用
dplyr
中的group_by()
函数转换为长整型的summary
,然后转置为宽幅. - 对向量进行排序,使名词组彼此相邻出现,并编写一个循环计算列的均值,并在每次迭代中采用三列步骤
- convert to long,
summarize
with agroup_by()
function indplyr
, and transpose back to wide. - resort the vectors so the noun groups appear next to each other, and write a loop that computes means across columns, taking three-column steps at each iteration
似乎 summarize_at
或 summarize_all
可能比我当前的任何一个选项都更有效地使用,但是我不确定如何以某种方式使用它通过命名约定对变量进行动态分组.
It seems like summarize_at
or summarize_all
could be used more effectively than either of my current options, but I'm not sure how to use it in a way that will dynamically group variables by naming convention.
有什么想法吗?
推荐答案
我们可以使用 split.default
根据列名的子字符串拆分列,并遍历 list
加上 sapply
和 rowMeans
,然后 cbind
与第一列
We can use split.default
to split the columns based on the substring of column names, loop over the list
with sapply
with rowMeans
and then cbind
with the first column
out <- cbind(df1[1], sapply(split.default(df1[-1],
sub("_.*", "", names(df1)[-1])), rowMeans, na.rm = TRUE))
或者我们可以使用 pivot_longer
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -id, names_sep="_", names_to = c(".value", "group")) %>%
group_by(id) %>%
summarise(across(turtle:castle, mean, na.rm = TRUE))
这篇关于各个列的摘要统计信息,其中列名表示组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!