当变量名称包含字符串信息时使用模式融化 - 避免强制转换为数字 [英] Melt using patterns when variable names contain string information - avoid coercion to numeric
问题描述
我正在使用 data.table::melt()
中的 patterns()
参数来融合具有多个易于定义模式的列的数据.它正在工作,但我没有看到如何创建字符索引变量而不是默认的数字细分.
I am using the patterns()
argument in data.table::melt()
to melt data that has columns that have several easily-defined patterns. It is working, but I'm not seeing how I can create a character index variable instead of the default numeric breakdown.
例如,在数据集 'A' 中,dog 和 cat 列名具有数字后缀(例如 'dog_1'、'cat_2'),在 melt
中可以正确处理(见结果变量"列):
For example, in data set 'A', the dog and cat column names have numeric suffixes (e.g. 'dog_1', 'cat_2'), which are handled correctly in melt
(see the resulting 'variable' column):
A = data.table(idcol = c(1:5),
dog_1 = c(1:5), cat_1 = c(101:105),
dog_2 = c(6:10), cat_2 = c(106:110),
dog_3 = c(11:15), cat_3 = c(111:115))
head(melt(A, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat")))
idcol variable dog cat
1: 1 1 1 101
2: 2 1 2 102
3: 3 1 3 103
4: 4 1 4 104
5: 5 1 5 105
6: 1 2 6 106
但是,在数据集 'B' 中,dog 和 cat 列的后缀是一个字符串(例如 'dog_one'、'cat_two').此类后缀在 melt
中被转换为数字表示,请参见变量".列.
However, in data set 'B', the suffix of dog and cat columns is a string (e.g. 'dog_one', 'cat_two'). Such suffixes are converted to a numeric representation in melt
, see the "variable" column.
B = data.table(idcol = c(1:5),
dog_one = c(1:5), cat_one = c(101:105),
dog_two = c(6:10), cat_two = c(106:110),
dog_three = c(11:15), cat_three = c(111:115))
head(melt(B, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat")))
idcol variable dog cat
1: 1 1 1 101
2: 2 1 2 102
3: 3 1 3 103
4: 4 1 4 104
5: 5 1 5 105
6: 1 2 6 106
如何填写变量"?具有正确字符串后缀的列是 1/2/3 而不是 1/2/3?
How can I fill the "variable" column with the correct string suffixes one/two/three instead of 1/2/3?
推荐答案
来自 data.table 1.14.1
(开发中;installation),新功能 measure
可以更轻松地融合数据并连接将变量名称转换为所需格式(参见 ?measure
.
From data.table 1.14.1
(in development; installation), the new function measure
makes it much easier to melt data with concatenated variable names to a desired format (see ?measure
.
sep
arator 参数用于创建不同的 measure.vars
组.在 ...
参数中,我们进一步指定了与 sep
生成的组相对应的值的命运.
The sep
arator argument is used to create different groups of measure.vars
. In the ...
argument, we further specify the fate of the values corresponding to the groups generated by sep
.
在 OP 中,变量名称的格式为 species_number
,例如dog_one
.因此,我们需要 ...
中的两个符号来指定 before 和 after 应该如何分组 sep
arator处理,一种用于物种(狗或猫),一种用于数量(一三).
In OP, the variable names are of the form species_number
, e.g. dog_one
. Thus, we need two symbols in ...
to specify how groups before and after the sep
arator should be treated, one for the species (dog or cat) and one for the numbers (one-three).
如果 ...
中的符号设置为 value.name
,则 "melt
返回 multiple 值列(名称由该组中的唯一值定义)".因此,因为您希望每个物种有多个列,由分隔符定义的 first 组,...
中的 first 符号应该是 value.name
.
If a symbol in ...
is set to value.name
, then "melt
returns multiple value columns (with names defined by the unique values in that group)". Thus, because you want multiple columns for each species, the first group defined by the separator, the first symbol in ...
should be value.name
.
second 组,在分隔符之后,是数字,所以这被指定为 ...
中的第二个符号.我们想要一个数字列,所以在这里我们指定输出变量的所需列名,例如nr".
The second group, after the separator, are the numbers, so this is specified as the second symbol in ...
. We want in a single value column for the numbers, so here we specify the desired column name of the output variable, e.g. "nr".
melt(B, measure.vars = measure(value.name, nr, sep = "_"))
idcol nr dog cat
# 1: 1 one 1 101
# 2: 2 one 2 102
# 3: 3 one 3 103
# 4: 4 one 4 104
# 5: 5 one 5 105
# 6: 1 two 6 106
# 7: 2 two 7 107
# 8: 3 two 8 108
# 9: 4 two 9 109
# 10: 5 two 10 110
# 11: 1 three 11 111
# 12: 2 three 12 112
# 13: 3 three 13 113
# 14: 4 three 14 114
# 15: 5 three 15 115
预data.table 1.14.1
可能有更简单的方法,但这似乎可行:
There might be easier ways, but this seems to work:
# grab suffixes of 'variable' names
suff <- unique(sub('^.*_', '', names(B[ , -1])))
# suff <- unique(tstrsplit(names(B[, -1]), "_")[[2]])
# melt
B2 <- melt(B, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat"))
# replace factor levels in 'variable' with the suffixes
setattr(B2$variable, "levels", suff)
B2
# idcol variable dog cat
# 1: 1 one 1 101
# 2: 2 one 2 102
# 3: 3 one 3 103
# 4: 4 one 4 104
# 5: 5 one 5 105
# 6: 1 two 6 106
# 7: 2 two 7 107
# 8: 3 two 8 108
# 9: 4 two 9 109
# 10: 5 two 10 110
# 11: 1 three 11 111
# 12: 2 three 12 112
# 13: 3 three 13 113
# 14: 4 three 14 114
# 15: 5 three 15 115
两个相关的data.table
问题:
melt.data.table 应该提供 variable
来匹配在名字上,而不是数字上
melt.data.table should offer variable
to match on the name, rather than the number
这是我认为 good'ol base::reshape
更干净的(罕见)实例之一.它的 sep
参数在这里派上用场——值"列的名称和变量"列的级别都是一次性生成的:
This is one of the (rare) instances where I believe good'ol base::reshape
is cleaner. Its sep
argument comes in handy here — both the names of the 'value' column and the levels of the 'variable' columns are generated in one go:
reshape(data = B,
varying = names(B[ , -1]),
sep = "_",
direction = "long")
这篇关于当变量名称包含字符串信息时使用模式融化 - 避免强制转换为数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!