与data.table,是SD [which.max(Var1)]最快的方式来找到一个组的最大值? [英] With data.table, is SD[which.max(Var1)] the fastest way to find the max of a group?

查看:99
本文介绍了与data.table,是SD [which.max(Var1)]最快的方式来找到一个组的最大值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  accts<  -  accts [, .SD [which.max(EE)],by = DnB.Name] 

大约350k行的DT,以及一些DnB.Name(Duns和Bradstreet公司名称)是具有不同雇员计数(EE)的重复,我只关心每个的最高数量,并且可以忽略其余。

解决方案

按EE排序,然后使用自连接取每个组的第一行:

 有序< -accts [ order(-EE)] #Dending order 
setkey(ordered,DnB.Name)#must setkey before join
ordered [J(unique(DnB.Name)),mult =first]

有关参考,请参阅这篇关于交叉验证的帖子: http://stats.stackexchange.com / questions / 7884 / fast-ways-in-r-to-get-the-first-row-of-a-data-frame-by-a-identifier



编辑:更快,但奇怪的语法:

  accts [accts [,。 max(EE)],by = DnB.Name] $ V1] 

有一个类似的问题:
按组与data.table的子集


If needed I can put together a dataset, but my question is somewhat general.

accts <- accts[, .SD[which.max(EE)], by=DnB.Name]

I've got a DT of about 350k rows, and some of the DnB.Name's (Duns and Bradstreet Company Name) are duplicates with differing employee counts (EE), I only care about the highest number of each and can disregard the rest.

Anyway, DT is usually lightning quick, so I figure I must be doing something wrong?

解决方案

sort by EE, then take the first row for each group using a self join:

 ordered<-accts[order(-EE)] #Descending order
 setkey(ordered,DnB.Name) #must setkey before join
 ordered[J(unique(DnB.Name)),mult="first"]

For reference, check out this post on crossvalidated: http://stats.stackexchange.com/questions/7884/fast-ways-in-r-to-get-the-first-row-of-a-data-frame-grouped-by-an-identifier

EDIT: even faster, but weird syntax:

accts[accts[, .I[which.max(EE)], by = DnB.Name]$V1]

For reference, check this post with a similar question: Subset by group with data.table

这篇关于与data.table,是SD [which.max(Var1)]最快的方式来找到一个组的最大值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆