各个列的摘要统计信息,其中列名表示组 [英] summary stats across columns, where column names indicate groups

查看:72
本文介绍了各个列的摘要统计信息,其中列名表示组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据帧具有包含遵循命名模式的数千个矢量.每个向量名称都包含一个名词,然后是 _a _b _c .以下是前10个var和obs:

Data frame have includes a few thousand vectors that follow a naming pattern. Each vector name includes a noun, then either _a, _b, or _c. Below are the first 10 vars and obs:

id  turtle_a   banana_a   castle_a   turtle_b   banana_b   castle_b   turtle_c   banana_c   castle_c
A      -0.58      -0.88      -0.56      -0.53      -0.32      -0.42      -0.52      -0.89      -0.72
B         NA         NA         NA      -0.84      -0.36      -0.26         NA         NA         NA
C       0.00      -0.43      -0.75      -0.35      -0.88      -0.14      -0.26      -0.15      -0.81
D      -0.81      -0.63      -0.77      -0.82      -0.83      -0.50      -0.77      -0.25      -0.07
E      -0.25      -0.33      -0.09      -0.51      -0.27      -0.81      -0.06      -0.23      -0.97
F      -0.80      -0.88      -0.05         NA         NA         NA         NA         NA         NA
G      -0.25      -0.76      -0.21         NA         NA         NA         NA         NA         NA
H      -0.47      -0.10      -0.67      -0.46      -0.71      -0.24      -0.76      -0.04      -0.11
I      -0.15      -0.34      -0.57      -0.40      -0.14      -0.49         NA         NA         NA
J      -0.65      -0.86      -0.37      -0.67      -0.81      -0.63         NA         NA         NA

数据框架 want 是名词组中每组变量在所有列中的均值.例如,对 id = A turtle_a turtle_b turtle_c 平均 -0.54 .如果我只是对示例中的少数名词组进行操作,这就是 want 的样子.

Data frame want is the mean across all columns for every set of variables in a noun group. For example, averaging turtle_a, turtle_b, and turtle_c for id=A equals -0.54. Here's what want looks like if I just do it for the handful of noun groups in the example.

id   turtle_m    banana_m    castle_m
A       -0.54       -0.70       -0.57
B       -0.84       -0.36       -0.26
C       -0.20       -0.49       -0.57
D       -0.80       -0.57       -0.45
E       -0.27       -0.28       -0.62
F       -0.80       -0.88       -0.05
G       -0.25       -0.76       -0.21
H       -0.56       -0.29       -0.34
I       -0.27       -0.24       -0.53
J       -0.66       -0.83       -0.50

到目前为止的选项:

  1. 使用 dplyr 中的 group_by()函数转换为长整型的 summary ,然后转置为宽幅.
  2. 对向量进行排序,使名词组彼此相邻出现,并编写一个循环计算列的均值,并在每次迭代中采用三列步骤
  1. convert to long, summarize with a group_by() function in dplyr, and transpose back to wide.
  2. resort the vectors so the noun groups appear next to each other, and write a loop that computes means across columns, taking three-column steps at each iteration

似乎 summarize_at summarize_all 可能比我当前的任何一个选项都更有效地使用,但是我不确定如何以某种方式使用它通过命名约定对变量进行动态分组.

It seems like summarize_at or summarize_all could be used more effectively than either of my current options, but I'm not sure how to use it in a way that will dynamically group variables by naming convention.

有什么想法吗?

推荐答案

我们可以使用 split.default 根据列名的子字符串拆分列,并遍历 list 加上 sapply rowMeans ,然后 cbind 与第一列

We can use split.default to split the columns based on the substring of column names, loop over the list with sapply with rowMeans and then cbind with the first column

out <- cbind(df1[1], sapply(split.default(df1[-1], 
    sub("_.*", "", names(df1)[-1])), rowMeans, na.rm = TRUE))


或者我们可以使用 pivot_longer

library(dplyr)
library(tidyr)
df1 %>% 
   pivot_longer(cols = -id, names_sep="_", names_to = c(".value", "group")) %>%
   group_by(id) %>%
   summarise(across(turtle:castle,  mean,  na.rm = TRUE))

这篇关于各个列的摘要统计信息,其中列名表示组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆