数据表中的分组因子级别 [英] Grouping factor levels in a data.table
问题描述
我试图在 data.table
&想知道是否有 data.table
这样做的方法。
示例:
DT = data.table(id = 1:20,ind = as.factor(sample(8,20,replace = TRUE)))
我想说类型1,3,8在A组; 2和4在B组中;和5,6,7在C组。
这是我一直在做的,这是在相当慢的完整版本的问题:
DT [ind%in%c(1,3,8),grp:= as.factor(A)]
DT [ind%in%c(2,4),grp:= as.factor(B)]
DT [ind%in%c(5,6,7),grp:= as .factor(C)]
另一种方法,由这个相关问题,我猜想翻译成这样:
DT [,grp:= ind]
levels(DT $ grp)= c(A,B,A,B ,C,C,C,A)
给定我有65个底层组和18个聚合组,这感觉有点整洁)
DT [,grp:= ind ]
lev < - letters(1:8)
lev [c(1,3,8)]< - A
lev [c(2,4)]< ; - B
lev [5:7]< - C
levels(DT $ grp)< - lev
这两个看起来都很笨重;这似乎是在 data.table
?
中执行此操作的适当方法。这包括10,000,000个观察值和一些更多的子组/超组级别。我原来的方法是最慢的(不得不运行所有这些逻辑检查是昂贵的),第二最快,第三近一秒。但我更喜欢这种方法的可读性。
(键入 DT
减少与后两种方法的差距)
更新:
我最近学到了一种更简单的方法来重新关联这个问题,并仔细阅读?levels
。没有合并,对应表等必要,只需传递一个名为的列表
到级别
:
级别(DT $ ind)= list(A = c(1,3,8),B = c(2,4),C = 5 :7)
原始答案:
根据@Arun的建议,我们可以选择创建对应的单独的 data.table
,然后将它加入到原始
match_dt = data.table(ind = as.factor(1:12),
/ pre>
grp = as。因子(c(A,B,A,B,C,C,
C,A,D,E ,D)))
setkey(DT,ind)
setkey(match_dt,ind)
DT = match_dt [DT]
我们也可以这样做(我认为是)更可读的时尚(边际速度成本):
levels< - letters [1:12]
levels [c(1,3,8)]< - A
level [c(2,4)] < - B
levels [5:7] < - C
levels [c(9,12)] < D
levels [10] < - E
levels [11] < - F
match_dtgrp = as.factor(levels))
setkey(DT,ind)
setkey(match_dt,ind)
DT = match_dt [DT]
I'm trying to combine factor levels in a
data.table
& wondering if there's adata.table
-y way to do so.Example:
DT = data.table(id = 1:20, ind = as.factor(sample(8, 20, replace = TRUE)))
I want to say types 1,3,8 are in group A; 2 and 4 are in group B; and 5,6,7 are in group C.
Here's what I've been doing, which has been quite slow in the full version of the problem:
DT[ind %in% c(1, 3, 8), grp := as.factor("A")] DT[ind %in% c(2, 4), grp := as.factor("B")] DT[ind %in% c(5, 6, 7), grp := as.factor("C")]
Another approach, suggested by this related question, would I guess translate like so:
DT[ , grp := ind] levels(DT$grp) = c("A", "B", "A", "B", "C", "C", "C", "A")
Or perhaps (given I've got 65 underlying groups and 18 aggregated groups, this feels a little neater)
DT[ , grp := ind] lev <- letters(1:8) lev[c(1, 3, 8)] <- "A" lev[c(2, 4)] <- "B" lev[5:7] <- "C" levels(DT$grp) <- lev
Both of these seem unwieldy; does this seem like the appropriate way to do this in
data.table
?For reference, I timed a beefed up version of this with 10,000,000 observations and some more subgroup/supergroup levels. My original approach is slowest (having to run all those logic checks is costly), the second the fastest, and the third a close second. But I like the readability of that approach better.
(Keying
DT
before searching speeds things up, but it only halves the gap vis-a-vis the latter two methods)解决方案Update:
I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of
?levels
. No merges, correspondence table, etc. necessary, just pass a namedlist
tolevels
:levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)
Original Answer:
As suggested by @Arun we have the option of creating the correspondence as a separate
data.table
, then joining it to the original:match_dt = data.table(ind = as.factor(1:12), grp = as.factor(c("A", "B", "A", "B", "C", "C", "C", "A", "D", "E", "F", "D"))) setkey(DT, ind) setkey(match_dt, ind) DT = match_dt[DT]
We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):
levels <- letters[1:12] levels[c(1, 3, 8)] <- "A" levels[c(2, 4)] <- "B" levels[5:7] <- "C" levels[c(9, 12)] <- "D" levels[10] <- "E" levels[11] <- "F" match_dt <- data.table(ind = as.factor(1:12), grp = as.factor(levels)) setkey(DT, ind) setkey(match_dt, ind) DT = match_dt[DT]
这篇关于数据表中的分组因子级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!