数据表中的分组因子级别 [英] Grouping factor levels in a data.table

查看：107 发布时间：2017/3/12 10:22:11 r data.table

本文介绍了数据表中的分组因子级别的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在 data.table &想知道是否有 data.table 这样做的方法。

示例：

  DT = data.table（id = 1:20，ind = as.factor（sample（8，20，replace = TRUE）））

我想说类型1,3,8在A组; 2和4在B组中;和5,6,7在C组。

这是我一直在做的，这是在相当慢的完整版本的问题：

  DT [ind％in％c（1,3,8），grp：= as.factor（A）] 
 DT [ind％in％c（2,4），grp：= as.factor（B）] 
 DT [ind％in％c（5,6,7），grp：= as .factor（C）]

另一种方法，由这个相关问题，我猜想翻译成这样：

  DT [，grp：= ind] 
 levels（DT $ grp）= c（A，B，A，B ，C，C，C，A）

给定我有65个底层组和18个聚合组，这感觉有点整洁）

  DT [，grp：= ind ] 
 lev < -  letters（1：8）
 lev [c（1,3,8）]<  - A
 lev [c（2,4）]< ;  - B
 lev [5：7]<  - C
 levels（DT $ grp）<  -  lev

这两个看起来都很笨重;这似乎是在 data.table ？

中执行此操作的适当方法。这包括10,000,000个观察值和一些更多的子组/超组级别。我原来的方法是最慢的（不得不运行所有这些逻辑检查是昂贵的），第二最快，第三近一秒。但我更喜欢这种方法的可读性。

（键入 DT 减少与后两种方法的差距）

解决方案

更新：

我最近学到了一种更简单的方法来重新关联这个问题，并仔细阅读？levels 。没有合并，对应表等必要，只需传递一个名为的列表到级别：

 级别（DT $ ind）= list（A = c（1,3,8），B = c（2,4），C = 5 ：7）

原始答案：

根据@Arun的建议，我们可以选择创建对应的单独的 data.table ，然后将它加入到原始

  match_dt = data.table（ind = as.factor（1:12），
 grp = as。因子（c（A，B，A，B，C，C，
C，A，D，E ，D）））
 setkey（DT，ind）
 setkey（match_dt，ind）
 DT = match_dt [DT] 
  / pre> 
 
 我们也可以这样做（我认为是）更可读的时尚（边际速度成本）：
  levels<  -  letters [1:12] 
 levels [c（1,3,8）]<  - A
 level [c（2,4）] < - B
 levels [5：7] < - C
 levels [c（9,12）] < D
 levels [10] < - E
 levels [11] < - F
 match_dt  grp = as.factor（levels））
 setkey（DT，ind）
 setkey（match_dt，ind）
 DT = match_dt [DT] 
  
 
I'm trying to combine factor levels in a data.table & wondering if there's a data.table-y way to do so.

Example:
DT = data.table(id = 1:20, ind = as.factor(sample(8, 20, replace = TRUE)))
I want to say types 1,3,8 are in group A; 2 and 4 are in group B; and 5,6,7 are in group C.

Here's what I've been doing, which has been quite slow in the full version of the problem:
DT[ind %in% c(1, 3, 8), grp := as.factor("A")]
DT[ind %in% c(2, 4), grp := as.factor("B")]
DT[ind %in% c(5, 6, 7), grp := as.factor("C")]
Another approach, suggested by this related question, would I guess translate like so:
DT[ , grp := ind]
levels(DT$grp) = c("A", "B", "A", "B", "C", "C", "C", "A")
Or perhaps (given I've got 65 underlying groups and 18 aggregated groups, this feels a little neater)
DT[ , grp := ind]
lev <- letters(1:8)
lev[c(1, 3, 8)] <- "A"
lev[c(2, 4)] <- "B"
lev[5:7] <- "C"
levels(DT$grp) <- lev
Both of these seem unwieldy; does this seem like the appropriate way to do this in data.table?

For reference, I timed a beefed up version of this with 10,000,000 observations and some more subgroup/supergroup levels. My original approach is slowest (having to run all those logic checks is costly), the second the fastest, and the third a close second. But I like the readability of that approach better.

(Keying DT before searching speeds things up, but it only halves the gap vis-a-vis the latter two methods)
 解决方案 
Update:

I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of ?levels. No merges, correspondence table, etc. necessary, just pass a named list to levels:
levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)




Original Answer:

As suggested by @Arun we have the option of creating the correspondence as a separate data.table, then joining it to the original:
match_dt = data.table(ind = as.factor(1:12),
                      grp = as.factor(c("A", "B", "A", "B", "C", "C",
                                        "C", "A", "D", "E", "F", "D")))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]
We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):
levels <- letters[1:12]
levels[c(1, 3, 8)] <- "A"
levels[c(2, 4)] <- "B"
levels[5:7] <- "C"
levels[c(9, 12)] <- "D"
levels[10] <- "E"
levels[11] <- "F"
match_dt <- data.table(ind = as.factor(1:12),
                       grp = as.factor(levels))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]


                        
这篇关于数据表中的分组因子级别的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Update:

Original Answer:

数据表中的分组因子级别 [英] Grouping factor levels in a data.table

问题描述

更新：

原始答案：

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

数据表中的分组因子级别 [英] Grouping factor levels in a data.table

问题描述

更新：

原始答案：

Update:

Original Answer:

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭