在 data.table 中对因子水平进行分组 [英] Grouping factor levels in a data.table
问题描述
我正在尝试在 data.table
&想知道是否有 data.table
-y 方法可以这样做.
I'm trying to combine factor levels in a data.table
& wondering if there's a data.table
-y way to do so.
例子:
DT = data.table(id = 1:20, ind = as.factor(sample(8, 20, replace = TRUE)))
我想说类型 1,3,8 属于 A 组;2和4在B组;和5,6,7在C组.
I want to say types 1,3,8 are in group A; 2 and 4 are in group B; and 5,6,7 are in group C.
这是我一直在做的,在问题的完整版中相当慢:
Here's what I've been doing, which has been quite slow in the full version of the problem:
DT[ind %in% c(1, 3, 8), grp := as.factor("A")]
DT[ind %in% c(2, 4), grp := as.factor("B")]
DT[ind %in% c(5, 6, 7), grp := as.factor("C")]
this 相关问题建议的另一种方法,我猜会这样翻译吗:
Another approach, suggested by this related question, would I guess translate like so:
DT[ , grp := ind]
levels(DT$grp) = c("A", "B", "A", "B", "C", "C", "C", "A")
或者也许(考虑到我有 65 个基础组和 18 个聚合组,这感觉更简洁一些)
Or perhaps (given I've got 65 underlying groups and 18 aggregated groups, this feels a little neater)
DT[ , grp := ind]
lev <- letters(1:8)
lev[c(1, 3, 8)] <- "A"
lev[c(2, 4)] <- "B"
lev[5:7] <- "C"
levels(DT$grp) <- lev
这两个看起来都很笨拙;这似乎是在 data.table
中执行此操作的适当方式吗?
Both of these seem unwieldy; does this seem like the appropriate way to do this in data.table
?
作为参考,我用 10,000,000 个观察值和更多的子组/超组级别对此进行了增强版本的计时.我最初的方法是最慢的(必须运行所有这些逻辑检查的成本很高),第二个最快,第三个紧随其后.但我更喜欢这种方法的可读性.
For reference, I timed a beefed up version of this with 10,000,000 observations and some more subgroup/supergroup levels. My original approach is slowest (having to run all those logic checks is costly), the second the fastest, and the third a close second. But I like the readability of that approach better.
(在搜索之前键入 DT
可以加快速度,但与后两种方法相比,它只会使差距减半)
(Keying DT
before searching speeds things up, but it only halves the gap vis-a-vis the latter two methods)
推荐答案
更新:
我最近从 this 问题和对 ?levels
的仔细阅读.不需要合并、对应表等,只需将命名的 list
传递给 levels
:
Update:
I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of ?levels
. No merges, correspondence table, etc. necessary, just pass a named list
to levels
:
levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)
原答案:
正如@Arun 所建议的,我们可以选择将对应关系创建为单独的data.table
,然后将其加入原始数据:
Original Answer:
As suggested by @Arun we have the option of creating the correspondence as a separate data.table
, then joining it to the original:
match_dt = data.table(ind = as.factor(1:12),
grp = as.factor(c("A", "B", "A", "B", "C", "C",
"C", "A", "D", "E", "F", "D")))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]
我们也可以以(我认为的)更具可读性的方式来做到这一点(以边际速度成本):
We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):
levels <- letters[1:12]
levels[c(1, 3, 8)] <- "A"
levels[c(2, 4)] <- "B"
levels[5:7] <- "C"
levels[c(9, 12)] <- "D"
levels[10] <- "E"
levels[11] <- "F"
match_dt <- data.table(ind = as.factor(1:12),
grp = as.factor(levels))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]
这篇关于在 data.table 中对因子水平进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!