整齐的数据框架,重复列名 [英] Tidy data.frame with repeated column names

查看:128
本文介绍了整齐的数据框架,重复列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序给我这种格式的数据

I have a program that gives me data in this format

toy
                file_path Condition Trial.Num A B  C  ID A B  C   ID  A B  C    ID
1     root/some.extension  Baseline         1 2 3  5 car 2 1  7 bike  4 9  0 plane
2    root/thing.extension  Baseline         2 3 6 45 car 5 4  4 bike  9 5  4 plane
3     root/else.extension  Baseline         3 4 4  6 car 7 5  4 bike 68 7 56 plane
4 root/uniquely.extension Treatment         1 5 3  7 car 1 7 37 bike  9 8  7 plane
5  root/defined.extension Treatment         2 6 7  3 car 4 6  8 bike  9 0  8 plane

我的目标是将格式整理成至少可以更轻松地整理具有独特列名称的重新格式的东西

My goal is to tidy the format into something that at least can be easier to finally tidy with reshape having unique column names

tidy_toy
                 file_path Condition Trial.Num  A B  C    ID
1      root/some.extension  Baseline         1  2 3  5   car
2     root/thing.extension  Baseline         2  3 6 45   car
3      root/else.extension  Baseline         3  4 4  6   car
4  root/uniquely.extension Treatment         1  5 3  7   car
5   root/defined.extension Treatment         2  6 7  3   car
6      root/some.extension  Baseline         1  2 1  7  bike
7     root/thing.extension  Baseline         2  5 4  4  bike
8      root/else.extension  Baseline         3  7 5  4  bike
9  root/uniquely.extension Treatment         1  1 7 37  bike
10  root/defined.extension Treatment         2  4 6  8  bike
11     root/some.extension  Baseline         1  4 9  0 plane
12    root/thing.extension  Baseline         2  9 5  4 plane
13     root/else.extension  Baseline         3 68 7 56 plane
14 root/uniquely.extension Treatment         1  9 8  7 plane
15  root/defined.extension Treatment         2  9 0  8 plane

如果我尝试 melt from toy 它不工作,因为onl您的第一个ID列将用于 id.vars (因此所有内容都将被标记为汽车)。相同的变量将被删除。

If I try to melt from toy it doesn't work because only the first ID column will get used for id.vars (hence everything will get tagged as cars). Identical variables will get dropped.

这是两个表的dput

Here's the dput of both tables

   structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
    "root/else.extension", "root/some.extension", "root/thing.extension", 
    "root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
    1L, 1L, 2L, 2L), .Label = c("Baseline", "Treatment"), class = "factor"), 
        Trial.Num = c(1L, 2L, 3L, 1L, 2L), A = 2:6, B = c(3L, 6L, 
        4L, 3L, 7L), C = c(5L, 45L, 6L, 7L, 3L), ID = structure(c(1L, 
        1L, 1L, 1L, 1L), .Label = "car", class = "factor"), A = c(2L, 
        5L, 7L, 1L, 4L), B = c(1L, 4L, 5L, 7L, 6L), C = c(7L, 4L, 
        4L, 37L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "bike", class = "factor"), 
        A = c(4L, 9L, 68L, 9L, 9L), B = c(9L, 5L, 7L, 8L, 0L), C = c(0L, 
        4L, 56L, 7L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "plane", class = "factor")), .Names = c("file_path", 
    "Condition", "Trial.Num", "A", "B", "C", "ID", "A", "B", "C", 
    "ID", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
    -5L))


structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L, 3L, 
4L, 2L, 5L, 1L, 3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
"root/else.extension", "root/some.extension", "root/thing.extension", 
"root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L), .Label = c("Baseline", 
"Treatment"), class = "factor"), Trial.Num = c(1L, 2L, 3L, 1L, 
2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 1L, 2L), A = c(2L, 3L, 4L, 
5L, 6L, 2L, 5L, 7L, 1L, 4L, 4L, 9L, 68L, 9L, 9L), B = c(3L, 6L, 
4L, 3L, 7L, 1L, 4L, 5L, 7L, 6L, 9L, 5L, 7L, 8L, 0L), C = c(5L, 
45L, 6L, 7L, 3L, 7L, 4L, 4L, 37L, 8L, 0L, 4L, 56L, 7L, 8L), ID = structure(c(2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), .Label = c("bike", 
"car", "plane"), class = "factor")), .Names = c("file_path", 
"Condition", "Trial.Num", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
-15L))


推荐答案

您可以使用 make.unique 功能来创建唯一的列名。之后,您可以使用 data.table -package中的 fusion ,这可以基于模式在列名中:

You can use the make.unique-function to create unique column names. After that you can use melt from the data.table-package which is able to create multiple value-columns based on patterns in the columnnames:

# make the column names unique
names(toy) <- make.unique(names(toy))
# let the 'Condition' column start with a small letter 'c'
# so it won't be detected by the patterns argument from melt
names(toy)[2] <- tolower(names(toy)[2])

# load the 'data.table' package
library(data.table)
# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A','^B','^C','^ID'), 
                 value.name = c('A','B','C','ID'))

其中:

 > tidy_toy
                  file_path condition Trial.Num variable  A B  C    ID
 1:     root/some.extension  Baseline         1        1  2 3  5   car
 2:    root/thing.extension  Baseline         2        1  3 6 45   car
 3:     root/else.extension  Baseline         3        1  4 4  6   car
 4: root/uniquely.extension Treatment         1        1  5 3  7   car
 5:  root/defined.extension Treatment         2        1  6 7  3   car
 6:     root/some.extension  Baseline         1        2  2 1  7  bike
 7:    root/thing.extension  Baseline         2        2  5 4  4  bike
 8:     root/else.extension  Baseline         3        2  7 5  4  bike
 9: root/uniquely.extension Treatment         1        2  1 7 37  bike
10:  root/defined.extension Treatment         2        2  4 6  8  bike
11:     root/some.extension  Baseline         1        3  4 9  0 plane
12:    root/thing.extension  Baseline         2        3  9 5  4 plane
13:     root/else.extension  Baseline         3        3 68 7 56 plane
14: root/uniquely.extension Treatment         1        3  9 8  7 plane
15:  root/defined.extension Treatment         2        3  9 0  8 plane






一种更复杂的方法,更好地区分为模式参数:

# select the names that are not unique
idx <- which(names(toy) %in% names(tt)[tt > 1])
nms <- names(toy)[idx]

# make them unique
names(toy)[idx] <- paste(nms, 
                         rep(seq(length(nms) / length(names(tt)[tt > 1])), 
                             each = length(names(tt)[tt > 1])), 
                         sep = '.')

# your columnnames are now unique:
> names(toy)
 [1] "file_path" "Condition" "Trial.Num" "A.1"       "B.1"       "C.1"       "ID.1"      "A.2"      
 [9] "B.2"       "C.2"       "ID.2"      "A.3"       "B.3"       "C.3"       "ID.3"     

# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A.\\d','^B.\\d','^C.\\d','^ID.\\d'), 
                 value.name = c('A','B','C','ID'))

这将给出相同的最终结果。

which will give the same end-result.

如评论中所述, janitor -package也可以帮助您解决此问题。 clean_names() make.unique 函数类似。 看到这里的解释。

As mentioned in the comments, the janitor-package can be helpful for this problem as well. The clean_names() works similar as the make.unique function. See here for an explanation.

这篇关于整齐的数据框架,重复列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆