r tidyverse-计算具有相同名称的多个列的均值 [英] r tidyverse - calculate mean across multiple columns with same name

查看:157
本文介绍了r tidyverse-计算具有相同名称的多个列的均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些每周收集的数据,其摘要如下,通过 dput

I have some data that is collected weekly, a snippet of which is like so, via dput:

p <- structure(list(railroad = structure(c(2L, 2L, 2L, 3L, 3L, 3L), .Label = 
c("All Other Railroads", 
"BNSF Railway Company", "CN", "CSX Transportation", "Norfolk Southern", 
"The Kansas City Southern Railway and Kansas City Southern de Mexico, S.A. de 
C.V. Consolidated ", 
"Union Pacific Railroad"), class = "factor"), measure = structure(c(1L, 
4L, 3L, 1L, 4L, 3L), .Label = c("Cars On Line - By Car Owner", 
"Cars On Line - By Car Type", "Terminal Dwell (Hours)", "Train Speed (MPH)"
), class = "factor"), category = structure(c(76L, 35L, 4L, 76L, 
35L, 29L), .Label = c("All Trains", "Allentown, PA", "Baltimore, MD", 
"Barstow, CA", "Bellevue, OH", "Birmingham, AL", "Box", "Buffalo, NY", 
"Chattanooga, TN", "Chicago (Proviso), IL", "Chicago, IL", "Cincinnati, OH", 
"Coal Unit", "Columbus, OH", "Conway, PA", "Corbin, KY", "Covered Hopper", 
"Decatur, IL", "Denver, CO", "Elkhart, IN", "Entire Railroad", 
"Fond du Lac Yard, WI", "Foreign RR", "Fort Worth, TX", "Galesburg, IL", 
"Gondola", "Grain Unit", "Hamlet, NC", "Harrison Yard (Memphis), TN", 
"Hinkle, OR", "Houston (Englewood), TX", "Houston (Settegast), TX", 
"Houston, TX", "Indianapolis, IN", "Intermodal", "Jackson Yard, MS", 
"Jackson, MS", "Kansas City, KS", "Kansas City, MO", "Knoxville, TN", 
"Laredo, TX", "Lincoln, NE", "Linwood, NC", "Livonia, LA", "Louisville, KY", 
"MacMillan Yard (Toronto), ON", "Macon, GA", "Manifest", "Markham Yard, IL", 
"Memphis, TN", "Monterrey, NL", "Montgomery, AL", "Multilevel", 
"Nashville, TN", "New Orleans, LA", "North Little Rock, AR", 
"North Platte East, NE", "North Platte West, NE", "Northtown, MN", 
"Nuevo Laredo, TM", "Open Hopper", "Other", "Pasco, WA", "Pct. Private", 
"Pine Bluff, AR", "Private", "Roanoke, VA", "Roseville, CA", 
"Russell, KY", "San Luis Potosi, SL", "Sanchez, TM", "Selkirk, NY", 
"Sheffield, AL", "Shreveport, LA", "Symington Yard (Winnipeg), MB", 
"System", "Tank", "Tascherau Yard (Montreal), QC", "Thornton Yard (Vancouver), 
BC", 
"Toledo, OH", "Total", "Tulsa, OK", "Walker Yard (Edmonton), AB", 
"Waycross, GA", "West Colton, CA", "Willard, OH"), class = "factor"), 
`201510` = c(66923, 33.9, 39.3, 40227, 30.8, 17.5), `201510` = c(66637, 
32.6, 56.6, 40778, 30.9, 18.3), `201510` = c(66309, 33.4, 
44.9, 40407, 30.5, 17.3), `201511` = c(65980, 34.6, 37.5, 
40316, 30.6, 17.5), `201511` = c(67034, 34.6, 43.1, 40174, 
30.4, 18.7)), row.names = c(1L, 15L, 21L, 33L, 47L, 53L), class = 
"data.frame")

共有143列,第4-143列是数字,我想计算所有具有相同c的列的均值列名。因此,下面的列 201510 重复了3次,列 201511 重复了两次。期望的输出是重复的每列的平均值。例如, 201510 将具有以下值:

There are 143 columns total, and columns 4 - 143 are numeric. I would like to calculate the mean for all columns that have the same column name. So below there is column 201510 repeated 3 times and column 201511 repeated twice. The desired output is the mean of each column repeated. For example, 201510 will have the following values:

`201510`
[1] 66623.00000    33.30000    46.93333 40470.66667    30.73333    17.70000

我尝试了以下代码:

library(tidyverse)

p = data.frame(p)

p %>%
  gather(time,value,railroad, measure, category) %>%                       
  mutate(time = gsub('X([^.]+)|.', '\\1', time)) %>%  
  group_by(time, value, railroad, measure, category) %>%                            
  summarise(MEAN = mean(value)) %>%                   
  ungroup() %>%                                       
  spread(time, MEAN)  

产生以下错误:

`Error in grouped_df_impl(data, unname(vars), drop) : 
Column `railroad` is unknown
In addition: Warning message:
attributes are not identical across measure variables;
they will be dropped `

有办法吗?

推荐答案

这里的主要问题是列名不唯一。 tidyverse大多假定唯一的列名,并且许多函数添加后缀以使它们唯一(如果还没有的话),许多基本函数也是如此,因此在以下所有解决方案中,我们都避免使用任何此类函数。我们仍然可以使用magrittr,purrr,某些基本函数仍然允许这样做。

The main problem here is the non-unique column names. The tidyverse mostly assumes unique column names and many functions add suffixes to make them unique if they were not already, as do many base functions, so in all the solutions below we simply avoid using any such function. We can still use magrittr, purrr and certain base functions still allow this too.

(1),(2)和(4)仅使用magrittr。 (1a)使用purrr,在(3)中,我们使用tidyr和dplyr,但仅在转换为长格式之后。

(1), (2) and (4) only use magrittr. (1a) uses purrr and in (3) we use tidyr and dplyr but only after converting to long form.

所有解决方案均会附加名称为平均值。* 表示数字列中的每个唯一名称。在问题的示例中,数字列之间有两个唯一的名称,因此在该示例中,它追加了两列,分别命名为 mean.201510 mean.201511 如下所示。我们仅在(1)中显示输出,但其余输出相似。

All solutions append a column whose name is of the form mean.* for each unique name among the numeric columns. In the example in the question there are two unique names among the numeric columns so for that example it appends two columns and they are named mean.201510 and mean.201511 as shown below. We only show the output in (1) but the output for the rest is similar.

所有解决方案都使用两个管道。第一个由第一个%>%组成,第二个pipleline作为 cbind 的自变量出现新列。

All solutions use two pipelines. The first consists of the first %>% and the second pipleline appears as an argument to cbind and is what creates the new columns.

(1),(1a)和(4)最短。

(1), (1a) and (4) are tied for the shortest.

1)magrittr magrittr 本身似乎没有添加后缀。 cbind 与以下原始数据帧 p 。首先将 p 转换为列列表,提取数字成分,将其拆分为列名,将每个成分转换为数据框,并获取每个的rowMeans,最后进行设置*。

1) magrittr magrittr itself does not seem to add suffixes. cbind the original data frame p with the following. First convert p to a list of columns, extract the numeric components, split that on the column names, convert each component to a data frame and take the rowMeans of each and finally set the names to mean.* .

library(magrittr)

p %>%
  cbind(as.list(.) %>%
    Filter(is.numeric, .) %>%
    split(names(.)) %>%
    lapply(as.data.frame) %>%
    lapply(rowMeans) %>%
    setNames(paste0("mean.", names(.)))
  )

给予:

               railroad                     measure                    category
1  BNSF Railway Company Cars On Line - By Car Owner                      System
15 BNSF Railway Company           Train Speed (MPH)                  Intermodal
21 BNSF Railway Company      Terminal Dwell (Hours)                 Barstow, CA
33                   CN Cars On Line - By Car Owner                      System
47                   CN           Train Speed (MPH)                  Intermodal
53                   CN      Terminal Dwell (Hours) Harrison Yard (Memphis), TN
    201510  201510  201510  201511  201511 mean.201510 mean.201511
1  66923.0 66637.0 66309.0 65980.0 67034.0 66623.00000     66507.0
15    33.9    32.6    33.4    34.6    34.6    33.30000        34.6
21    39.3    56.6    44.9    37.5    43.1    46.93333        40.3
33 40227.0 40778.0 40407.0 40316.0 40174.0 40470.66667     40245.0
47    30.8    30.9    30.5    30.6    30.4    30.73333        30.5
53    17.5    18.3    17.3    17.5    18.7    17.70000        18.1

1a)purrr (可选),我们可以将某些基本函数替换为它们的purrr或magrittr。在其他解决方案中,我们也可以转换为purrr。

1a) purrr Optionally we could replace some of the base functions with their purrr or magrittr equivalent. We could also translate to purrr in the other solutions.

library(magrittr)
library(purrr)

p %>%
  cbind(as.list(.) %>%
    keep(is.numeric) %>%
    split(names(.)) %>%
    map(as.data.frame) %>%
    map(rowMeans) %>%
    set_names(paste0("mean.", names(.)))
  )

2)申请/申请 分别在每行中分别套用 apply 行执行此操作。

2) apply/tapply Another possibility is to tapply across each row separately. The apply line does this.

library(magrittr)

p %>%
  cbind(as.list(.) %>%
    Filter(is.numeric, .) %>%
    do.call("cbind", .) %>%
    apply(1, tapply, colnames(.), mean) %>%
    t %>%
    as.data.frame %>%
    setNames(paste0("mean.", names(.)))
  )

3)as.data.frame.table 该方法对大多数操作都使用dplyr和tidyr,但使用 as.data.frame.table 从基础而不是 gather 转换为长格式,以避免添加后缀的问题。

3) as.data.frame.table This approach uses dplyr and tidyr for most operations but uses as.data.frame.table from base instead of gather to convert to long form in order to avoid the probem of adding suffixes.

library(dplyr)
library(magrittr)
library(tidyr)

p %>%
  cbind(as.list(.) %>%
    keep(is.numeric) %>%
    do.call("cbind", .) %>%
    as.data.frame.table %>%
    group_by(Var2, Var1) %>%
    summarize(Mean = mean(Freq)) %>%
    ungroup %>%
    spread(Var2, Mean) %>%
    select(-Var1) %>%
    set_names(paste0("mean.", names(.)))
  )

4)lm 如果 X 是数字列,而均值。是列名,然后 t(coef(lm(t(X)〜Mean。 -1)))给出所需的均值列,因此:

4) lm If X is the numeric columns and mean. is the column names then t(coef(lm(t(X) ~ mean. - 1))) gives the required mean columns so:

library(magrittr)

p %>%
  cbind(as.list(.) %>%
    Filter(is.numeric, .) %>%
    do.call("cbind", .) %>%
    { lm(t(.) ~ mean. - 1, data.frame(mean. = colnames(.))) } %>%
    coef %>%
    t
  )

这篇关于r tidyverse-计算具有相同名称的多个列的均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆