r tidyverse-计算具有相同名称的多个列的均值 [英] r tidyverse - calculate mean across multiple columns with same name
问题描述
我有一些每周收集的数据,其摘要如下,通过 dput
:
I have some data that is collected weekly, a snippet of which is like so, via dput
:
p <- structure(list(railroad = structure(c(2L, 2L, 2L, 3L, 3L, 3L), .Label =
c("All Other Railroads",
"BNSF Railway Company", "CN", "CSX Transportation", "Norfolk Southern",
"The Kansas City Southern Railway and Kansas City Southern de Mexico, S.A. de
C.V. Consolidated ",
"Union Pacific Railroad"), class = "factor"), measure = structure(c(1L,
4L, 3L, 1L, 4L, 3L), .Label = c("Cars On Line - By Car Owner",
"Cars On Line - By Car Type", "Terminal Dwell (Hours)", "Train Speed (MPH)"
), class = "factor"), category = structure(c(76L, 35L, 4L, 76L,
35L, 29L), .Label = c("All Trains", "Allentown, PA", "Baltimore, MD",
"Barstow, CA", "Bellevue, OH", "Birmingham, AL", "Box", "Buffalo, NY",
"Chattanooga, TN", "Chicago (Proviso), IL", "Chicago, IL", "Cincinnati, OH",
"Coal Unit", "Columbus, OH", "Conway, PA", "Corbin, KY", "Covered Hopper",
"Decatur, IL", "Denver, CO", "Elkhart, IN", "Entire Railroad",
"Fond du Lac Yard, WI", "Foreign RR", "Fort Worth, TX", "Galesburg, IL",
"Gondola", "Grain Unit", "Hamlet, NC", "Harrison Yard (Memphis), TN",
"Hinkle, OR", "Houston (Englewood), TX", "Houston (Settegast), TX",
"Houston, TX", "Indianapolis, IN", "Intermodal", "Jackson Yard, MS",
"Jackson, MS", "Kansas City, KS", "Kansas City, MO", "Knoxville, TN",
"Laredo, TX", "Lincoln, NE", "Linwood, NC", "Livonia, LA", "Louisville, KY",
"MacMillan Yard (Toronto), ON", "Macon, GA", "Manifest", "Markham Yard, IL",
"Memphis, TN", "Monterrey, NL", "Montgomery, AL", "Multilevel",
"Nashville, TN", "New Orleans, LA", "North Little Rock, AR",
"North Platte East, NE", "North Platte West, NE", "Northtown, MN",
"Nuevo Laredo, TM", "Open Hopper", "Other", "Pasco, WA", "Pct. Private",
"Pine Bluff, AR", "Private", "Roanoke, VA", "Roseville, CA",
"Russell, KY", "San Luis Potosi, SL", "Sanchez, TM", "Selkirk, NY",
"Sheffield, AL", "Shreveport, LA", "Symington Yard (Winnipeg), MB",
"System", "Tank", "Tascherau Yard (Montreal), QC", "Thornton Yard (Vancouver),
BC",
"Toledo, OH", "Total", "Tulsa, OK", "Walker Yard (Edmonton), AB",
"Waycross, GA", "West Colton, CA", "Willard, OH"), class = "factor"),
`201510` = c(66923, 33.9, 39.3, 40227, 30.8, 17.5), `201510` = c(66637,
32.6, 56.6, 40778, 30.9, 18.3), `201510` = c(66309, 33.4,
44.9, 40407, 30.5, 17.3), `201511` = c(65980, 34.6, 37.5,
40316, 30.6, 17.5), `201511` = c(67034, 34.6, 43.1, 40174,
30.4, 18.7)), row.names = c(1L, 15L, 21L, 33L, 47L, 53L), class =
"data.frame")
共有143列,第4-143列是数字,我想计算所有具有相同c的列的均值列名。因此,下面的列 201510
重复了3次,列 201511
重复了两次。期望的输出是重复的每列的平均值。例如, 201510
将具有以下值:
There are 143 columns total, and columns 4 - 143 are numeric. I would like to calculate the mean for all columns that have the same column name. So below there is column 201510
repeated 3 times and column 201511
repeated twice. The desired output is the mean of each column repeated. For example, 201510
will have the following values:
`201510`
[1] 66623.00000 33.30000 46.93333 40470.66667 30.73333 17.70000
我尝试了以下代码:
library(tidyverse)
p = data.frame(p)
p %>%
gather(time,value,railroad, measure, category) %>%
mutate(time = gsub('X([^.]+)|.', '\\1', time)) %>%
group_by(time, value, railroad, measure, category) %>%
summarise(MEAN = mean(value)) %>%
ungroup() %>%
spread(time, MEAN)
产生以下错误:
`Error in grouped_df_impl(data, unname(vars), drop) :
Column `railroad` is unknown
In addition: Warning message:
attributes are not identical across measure variables;
they will be dropped `
有办法吗?
推荐答案
这里的主要问题是列名不唯一。 tidyverse大多假定唯一的列名,并且许多函数添加后缀以使它们唯一(如果还没有的话),许多基本函数也是如此,因此在以下所有解决方案中,我们都避免使用任何此类函数。我们仍然可以使用magrittr,purrr,某些基本函数仍然允许这样做。
The main problem here is the non-unique column names. The tidyverse mostly assumes unique column names and many functions add suffixes to make them unique if they were not already, as do many base functions, so in all the solutions below we simply avoid using any such function. We can still use magrittr, purrr and certain base functions still allow this too.
(1),(2)和(4)仅使用magrittr。 (1a)使用purrr,在(3)中,我们使用tidyr和dplyr,但仅在转换为长格式之后。
(1), (2) and (4) only use magrittr. (1a) uses purrr and in (3) we use tidyr and dplyr but only after converting to long form.
所有解决方案均会附加名称为平均值。*
表示数字列中的每个唯一名称。在问题的示例中,数字列之间有两个唯一的名称,因此在该示例中,它追加了两列,分别命名为 mean.201510
和 mean.201511
如下所示。我们仅在(1)中显示输出,但其余输出相似。
All solutions append a column whose name is of the form mean.*
for each unique name among the numeric columns. In the example in the question there are two unique names among the numeric columns so for that example it appends two columns and they are named mean.201510
and mean.201511
as shown below. We only show the output in (1) but the output for the rest is similar.
所有解决方案都使用两个管道。第一个由第一个%>%
组成,第二个pipleline作为 cbind
的自变量出现新列。
All solutions use two pipelines. The first consists of the first %>%
and the second pipleline appears as an argument to cbind
and is what creates the new columns.
(1),(1a)和(4)最短。
(1), (1a) and (4) are tied for the shortest.
1)magrittr magrittr
本身似乎没有添加后缀。 cbind
与以下原始数据帧 p
。首先将 p
转换为列列表,提取数字成分,将其拆分为列名,将每个成分转换为数据框,并获取每个的rowMeans,最后进行设置*。
1) magrittr magrittr
itself does not seem to add suffixes. cbind
the original data frame p
with the following. First convert p
to a list of columns, extract the numeric components, split that on the column names, convert each component to a data frame and take the rowMeans of each and finally set the names to mean.* .
library(magrittr)
p %>%
cbind(as.list(.) %>%
Filter(is.numeric, .) %>%
split(names(.)) %>%
lapply(as.data.frame) %>%
lapply(rowMeans) %>%
setNames(paste0("mean.", names(.)))
)
给予:
railroad measure category
1 BNSF Railway Company Cars On Line - By Car Owner System
15 BNSF Railway Company Train Speed (MPH) Intermodal
21 BNSF Railway Company Terminal Dwell (Hours) Barstow, CA
33 CN Cars On Line - By Car Owner System
47 CN Train Speed (MPH) Intermodal
53 CN Terminal Dwell (Hours) Harrison Yard (Memphis), TN
201510 201510 201510 201511 201511 mean.201510 mean.201511
1 66923.0 66637.0 66309.0 65980.0 67034.0 66623.00000 66507.0
15 33.9 32.6 33.4 34.6 34.6 33.30000 34.6
21 39.3 56.6 44.9 37.5 43.1 46.93333 40.3
33 40227.0 40778.0 40407.0 40316.0 40174.0 40470.66667 40245.0
47 30.8 30.9 30.5 30.6 30.4 30.73333 30.5
53 17.5 18.3 17.3 17.5 18.7 17.70000 18.1
1a)purrr (可选),我们可以将某些基本函数替换为它们的purrr或magrittr。在其他解决方案中,我们也可以转换为purrr。
1a) purrr Optionally we could replace some of the base functions with their purrr or magrittr equivalent. We could also translate to purrr in the other solutions.
library(magrittr)
library(purrr)
p %>%
cbind(as.list(.) %>%
keep(is.numeric) %>%
split(names(.)) %>%
map(as.data.frame) %>%
map(rowMeans) %>%
set_names(paste0("mean.", names(.)))
)
2)申请/申请 分别在每行中分别套用
。 apply
行执行此操作。
2) apply/tapply Another possibility is to tapply
across each row separately. The apply
line does this.
library(magrittr)
p %>%
cbind(as.list(.) %>%
Filter(is.numeric, .) %>%
do.call("cbind", .) %>%
apply(1, tapply, colnames(.), mean) %>%
t %>%
as.data.frame %>%
setNames(paste0("mean.", names(.)))
)
3)as.data.frame.table 该方法对大多数操作都使用dplyr和tidyr,但使用 as.data.frame.table
从基础而不是 gather
转换为长格式,以避免添加后缀的问题。
3) as.data.frame.table This approach uses dplyr and tidyr for most operations but uses as.data.frame.table
from base instead of gather
to convert to long form in order to avoid the probem of adding suffixes.
library(dplyr)
library(magrittr)
library(tidyr)
p %>%
cbind(as.list(.) %>%
keep(is.numeric) %>%
do.call("cbind", .) %>%
as.data.frame.table %>%
group_by(Var2, Var1) %>%
summarize(Mean = mean(Freq)) %>%
ungroup %>%
spread(Var2, Mean) %>%
select(-Var1) %>%
set_names(paste0("mean.", names(.)))
)
4)lm 如果 X
是数字列,而均值。
是列名,然后 t(coef(lm(t(X)〜Mean。 -1)))
给出所需的均值列,因此:
4) lm If X
is the numeric columns and mean.
is the column names then t(coef(lm(t(X) ~ mean. - 1)))
gives the required mean columns so:
library(magrittr)
p %>%
cbind(as.list(.) %>%
Filter(is.numeric, .) %>%
do.call("cbind", .) %>%
{ lm(t(.) ~ mean. - 1, data.frame(mean. = colnames(.))) } %>%
coef %>%
t
)
这篇关于r tidyverse-计算具有相同名称的多个列的均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!