当变量名称包含字符串信息时使用模式融化 - 避免强制转换为数字 [英] Melt using patterns when variable names contain string information - avoid coercion to numeric

查看:15
本文介绍了当变量名称包含字符串信息时使用模式融化 - 避免强制转换为数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 data.table::melt() 中的 patterns() 参数来融合具有多个易于定义模式的列的数据.它正在工作,但我没有看到如何创建字符索引变量而不是默认的数字细分.

I am using the patterns() argument in data.table::melt() to melt data that has columns that have several easily-defined patterns. It is working, but I'm not seeing how I can create a character index variable instead of the default numeric breakdown.

例如,在数据集 'A' 中,dog 和 cat 列名具有数字后缀(例如 'dog_1'、'cat_2'),在 melt 中可以正确处理(见结果变量"列):

For example, in data set 'A', the dog and cat column names have numeric suffixes (e.g. 'dog_1', 'cat_2'), which are handled correctly in melt (see the resulting 'variable' column):

A = data.table(idcol = c(1:5),
            dog_1 = c(1:5),   cat_1 = c(101:105),
            dog_2 = c(6:10),  cat_2 = c(106:110),
            dog_3 = c(11:15), cat_3 = c(111:115))  

head(melt(A, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat")))
   
   idcol variable dog cat
1:     1        1   1 101
2:     2        1   2 102
3:     3        1   3 103
4:     4        1   4 104
5:     5        1   5 105
6:     1        2   6 106

但是,在数据集 'B' 中,dog 和 cat 列的后缀是一个字符串(例如 'dog_one'、'cat_two').此类后缀在 melt 中被转换为数字表示,请参见变量".列.

However, in data set 'B', the suffix of dog and cat columns is a string (e.g. 'dog_one', 'cat_two'). Such suffixes are converted to a numeric representation in melt, see the "variable" column.

B = data.table(idcol = c(1:5),
                dog_one = c(1:5),     cat_one = c(101:105),
                dog_two = c(6:10),    cat_two = c(106:110),
                dog_three = c(11:15), cat_three = c(111:115))

head(melt(B, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat")))

   idcol variable dog cat
1:     1        1   1 101
2:     2        1   2 102
3:     3        1   3 103
4:     4        1   4 104
5:     5        1   5 105
6:     1        2   6 106

如何填写变量"?具有正确字符串后缀的列是 1/2/3 而不是 1/2/3?

How can I fill the "variable" column with the correct string suffixes one/two/three instead of 1/2/3?

推荐答案

来自 data.table 1.14.1(开发中;installation),新功能 measure 可以更轻松地融合数据并连接将变量名称转换为所需格式(参见 ?measure.

From data.table 1.14.1 (in development; installation), the new function measure makes it much easier to melt data with concatenated variable names to a desired format (see ?measure.

separator 参数用于创建不同的 measure.vars 组.在 ... 参数中,我们进一步指定了与 sep 生成的组相对应的值的命运.

The separator argument is used to create different groups of measure.vars. In the ... argument, we further specify the fate of the values corresponding to the groups generated by sep.

在 OP 中,变量名称的格式为 species_number,例如dog_one.因此,我们需要 ... 中的两个符号来指定 beforeafter 应该如何分组 separator处理,一种用于物种(狗或猫),一种用于数量(一三).

In OP, the variable names are of the form species_number, e.g. dog_one. Thus, we need two symbols in ... to specify how groups before and after the separator should be treated, one for the species (dog or cat) and one for the numbers (one-three).

如果 ... 中的符号设置为 value.name,则 "melt 返回 multiple 值列(名称由该组中的唯一值定义)".因此,因为您希望每个物种有多个列,由分隔符定义的 first 组,... 中的 first 符号应该是 value.name.

If a symbol in ... is set to value.name, then "melt returns multiple value columns (with names defined by the unique values in that group)". Thus, because you want multiple columns for each species, the first group defined by the separator, the first symbol in ... should be value.name.

second 组,在分隔符之后,是数字,所以这被指定为 ... 中的第二个符号.我们想要一个数字列,所以在这里我们指定输出变量的所需列名,例如nr".

The second group, after the separator, are the numbers, so this is specified as the second symbol in .... We want in a single value column for the numbers, so here we specify the desired column name of the output variable, e.g. "nr".

melt(B, measure.vars = measure(value.name, nr, sep = "_"))

      idcol    nr dog cat
#  1:     1   one   1 101
#  2:     2   one   2 102
#  3:     3   one   3 103
#  4:     4   one   4 104
#  5:     5   one   5 105
#  6:     1   two   6 106
#  7:     2   two   7 107
#  8:     3   two   8 108
#  9:     4   two   9 109
# 10:     5   two  10 110
# 11:     1 three  11 111
# 12:     2 three  12 112
# 13:     3 three  13 113
# 14:     4 three  14 114
# 15:     5 three  15 115


data.table 1.14.1

可能有更简单的方法,但这似乎可行:

There might be easier ways, but this seems to work:

# grab suffixes of 'variable' names
suff <- unique(sub('^.*_', '', names(B[ , -1])))
# suff <- unique(tstrsplit(names(B[, -1]), "_")[[2]])

# melt
B2 <- melt(B, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat"))
   
# replace factor levels in 'variable' with the suffixes
setattr(B2$variable, "levels", suff)

B2
#     idcol variable dog cat
# 1:      1      one   1 101
# 2:      2      one   2 102
# 3:      3      one   3 103
# 4:      4      one   4 104
# 5:      5      one   5 105
# 6:      1      two   6 106
# 7:      2      two   7 107
# 8:      3      two   8 108
# 9:      4      two   9 109
# 10:     5      two  10 110
# 11:     1    three  11 111
# 12:     2    three  12 112
# 13:     3    three  13 113
# 14:     4    three  14 114
# 15:     5    three  15 115

两个相关的data.table问题:

melt.data.table 应该提供 variable 来匹配在名字上,而不是数字上

melt.data.table should offer variable to match on the name, rather than the number

FR:扩展处理输出名称的熔化功能.

这是我认为 good'ol base::reshape 更干净的(罕见)实例之一.它的 sep 参数在这里派上用场——值"列的名称和变量"列的级别都是一次性生成的:

This is one of the (rare) instances where I believe good'ol base::reshape is cleaner. Its sep argument comes in handy here — both the names of the 'value' column and the levels of the 'variable' columns are generated in one go:

reshape(data = B,
        varying = names(B[ , -1]),
        sep = "_",
        direction = "long")

这篇关于当变量名称包含字符串信息时使用模式融化 - 避免强制转换为数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆