与data.table,是SD [which.max(Var1)]最快的方式来找到一个组的最大值? [英] With data.table, is SD[which.max(Var1)] the fastest way to find the max of a group?
问题描述
accts< - accts [, .SD [which.max(EE)],by = DnB.Name]
大约350k行的DT,以及一些DnB.Name(Duns和Bradstreet公司名称)是具有不同雇员计数(EE)的重复,我只关心每个的最高数量,并且可以忽略其余。
按EE排序,然后使用自连接取每个组的第一行:
有序< -accts [ order(-EE)] #Dending order
setkey(ordered,DnB.Name)#must setkey before join
ordered [J(unique(DnB.Name)),mult =first]
有关参考,请参阅这篇关于交叉验证的帖子: http://stats.stackexchange.com / questions / 7884 / fast-ways-in-r-to-get-the-first-row-of-a-data-frame-by-a-identifier
编辑:更快,但奇怪的语法:
accts [accts [,。 max(EE)],by = DnB.Name] $ V1]
有一个类似的问题:
按组与data.table的子集
If needed I can put together a dataset, but my question is somewhat general.
accts <- accts[, .SD[which.max(EE)], by=DnB.Name]
I've got a DT of about 350k rows, and some of the DnB.Name's (Duns and Bradstreet Company Name) are duplicates with differing employee counts (EE), I only care about the highest number of each and can disregard the rest.
Anyway, DT is usually lightning quick, so I figure I must be doing something wrong?
sort by EE, then take the first row for each group using a self join:
ordered<-accts[order(-EE)] #Descending order
setkey(ordered,DnB.Name) #must setkey before join
ordered[J(unique(DnB.Name)),mult="first"]
For reference, check out this post on crossvalidated: http://stats.stackexchange.com/questions/7884/fast-ways-in-r-to-get-the-first-row-of-a-data-frame-grouped-by-an-identifier
EDIT: even faster, but weird syntax:
accts[accts[, .I[which.max(EE)], by = DnB.Name]$V1]
For reference, check this post with a similar question: Subset by group with data.table
这篇关于与data.table,是SD [which.max(Var1)]最快的方式来找到一个组的最大值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!