如何快速获取data.table中的计数摘要 [英] how to get quick summary of count in data.table

查看:69
本文介绍了如何快速获取data.table中的计数摘要的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是功能工程的一部分,该功能根据称为Col的列汇总每个ID。相同的预处理将应用于测试集。由于数据集很大,因此可能更优选基于数据表的解决方案。

This is a part of feature engineering that summarizes each ID depending on column called Col. The same preprocess will be applied to the testing set. Since the data set is large, data.table based solution may be more preferred.

培训输入:

ID   Col
A    M
A    M
A    M
B    K
B    M

预期产量对于上述训练输入:

Expected output for above training input:

ID   Col_M  Col_K
A    3      0      # A has 3 M in Col and 0 K in Col
B    1      1  

以上用于处理训练数据。对于测试数据集,如果需要映射到Col_M,Col_K,则意味着,如果其他值(如S)出现在Col中,它将被忽略。

Above is for processing training data. For testing dataset, if requires to mapping over Col_M, Col_K, meaning, if other value like S appearing in Col, it will be ignored.

测试输入:

ID   Col 
C    M
C    S

上述测试输入的预期输出:

Expected output for above testing input:

ID   Col_M  Col_K
C    1      0      # A has 1 M in Col and 0 K in Col. S value is ignored


推荐答案

可能的 data.table 实现可以首先用 c( M, K)进行过滤,然后添加这些级别(以防第二种情况不存在),然后运行 dcast ,同时指定 drop = FALSE,填充= 0L (对于缺少所需水平之一的情况) fun = length (为了计数)。

A possible data.table implementation could be first filter by c("M", "K"), then add these level (in case they aren't present like in your second case), then running dcast while specifying drop = FALSE, fill = 0L (for the cases when one of the desired levels is missing) while specifying fun = length (in order to count).

在两个数据集上进行测试

Testing on both data sets

library(data.table)

### First example
df <- fread("ID   Col
A    M
A    M
A    M
B    K
B    M")

dcast(df[Col %in% c("M", "K")], # Work only with c("M", "K")
      ID ~ factor(Col, levels = union(unique(Col), c("M", "K"))), # Add missing levels
      drop = FALSE, # Keep missing levels in output
      fill = 0L, # Fill missing values with zeroes instead of NAs
      fun = length) # Count. you can also specify 'value.var'

#    ID M K
# 1:  A 3 0
# 2:  B 1 1

### Second example
df <- fread("ID   Col 
C    M
C    S")

dcast(df[Col %in% c("M", "K")], 
  ID ~ factor(Col, levels = union(unique(Col), c("M", "K"))), 
  drop = FALSE,
  fill = 0L,
  fun = length)

#    ID M K
# 1:  C 1 0

这篇关于如何快速获取data.table中的计数摘要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆