按升序/降序快速排序 data.table [英] Sort a data.table fast by Ascending/Descending order

查看:25
本文介绍了按升序/降序快速排序 data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大约 300 万行和 40 列的 data.table.我想按组内的降序对该表进行排序,如下面的 sql 模拟代码:

I have a data.table with about 3 million rows and 40 columns. I would like to sort this table by descending order within groups like the following sql mock code:

sort by ascending Year, ascending MemberID, descending Month 

data.table 中是否有等效的方法来执行此操作?到目前为止,我必须将其分解为 2 个步骤:

Is there an equivalent way in data.table to do this? So far I have to break it down into 2 steps:

setkey(X, Year, MemberID)

这非常快,只需几秒钟.

This is very fast and takes only a few second.

X <- X[,.SD[order(-Month)],by=list(Year, MemberID)]

这一步需要更长的时间(5 分钟).

This step takes so much longer (5 minutes).

更新:有人发表评论做 X <- X[sort(Year, MemberID, -Month)] 后删除.这种方法似乎要快得多:

Update: Someone made a comment to do X <- X[sort(Year, MemberID, -Month)] and later deleted. This approach seems to be much faster:

user  system elapsed 
5.560  11.242  66.236 

我的方法:setkey() 然后 order(-Month)

My approach: setkey() then order(-Month)

   user  system elapsed 
816.144   9.648 848.798 

我现在的问题是:如果我想在排序后(Year, MemberID, Month)按Year,MemberId和Month进行汇总,data.table是否识别排序顺序?

My question is now: if I want to summarize by Year, MemberId and Month after sort(Year, MemberID, Month), does data.table recognize the sort order?

更新 2:回复 Matthew Dowle:

Update 2: to response to Matthew Dowle:

在使用 Year、MemberID 和 Month 设置键后,我仍然每个组有多个记录.我想为每个组进行总结.我的意思是:如果我使用 X[order(Year, MemberID, Month)],求和是否利用 data.table 的二进制搜索功能:

After setkey with Year, MemberID and Month, I still have multiple records per group. What I would like is to summarize for each of the groups. What I meant was: if I use X[order(Year, MemberID, Month)], does the summation utilizes binary search functionality of data.table:

monthly.X <- X[, lapply(.SD[], sum), by = list(Year, MemberID, Month)]

更新 3:Matthew D 提出了几种方法.第一种方法的运行时间比 order() 方法快:

Update 3: Matthew D proposed several approaches. Run time for the first approach is faster than order() approach:

   user  system elapsed 
  7.910   7.750  53.916 

马修:让我感到惊讶的是转换月份的符号需要大部分时间.没有它,setkey 速度很快.

Matthew: what surprised me was converting the sign of Month takes most of the time. Without it, setkey is blazing fast.

推荐答案

2014 年 6 月 5 日更新:

data.table v1.9.3的当前开发版本实现了两个新功能,分别是:setordersetorderv,正是你需要的.这些函数对 data.table 按引用 重新排序,并可选择按每列的升序或降序进行排序.查看 ?setorder 了解更多信息.

Update June 5 2014:

The current development version of data.table v1.9.3 has two new functions implemented, namely: setorder and setorderv, which does exactly what you require. These functions reorder the data.table by reference with the option to choose either ascending or descending order on each column to order by. Check out ?setorder for more info.

此外,DT[order(.)] 也默认优化为使用 data.tableinternal fast orderbase:::order.与 setorder 不同,这将生成数据的完整副本,因此内存效率较低,但仍比使用 base 的 order 操作快几个数量级.

In addition, DT[order(.)] is also by default optimised to use data.table's internal fast order instead of base:::order. This, unlike setorder, will make an entire copy of the data, and is therefore less memory efficient, but will still be orders of magnitude faster than operating using base's order.

这是使用 setorder、data.table 的内部快速顺序和 base:::order:

Here's an illustration on the speed differences using setorder, data.table's internal fast order and with base:::order:

require(data.table) ## 1.9.3
set.seed(1L)
DT <- data.table(Year     = sample(1950:2000, 3e6, TRUE), 
                 memberID = sample(paste0("V", 1:1e4), 3e6, TRUE), 
                 month    = sample(12, 3e6, TRUE))

## using base:::order
system.time(ans1 <- DT[base:::order(Year, memberID, -month)])
#   user  system elapsed 
# 76.909   0.262  81.266 

## optimised to use data.table's fast order
system.time(ans2 <- DT[order(Year, memberID, -month)])
#   user  system elapsed 
#  0.985   0.030   1.027

## reorders by reference
system.time(setorder(DT, Year, memberID, -month))
#   user  system elapsed 
#  0.585   0.013   0.600 

## or alternatively
## setorderv(DT, c("Year", "memberID", "month"), c(1,1,-1))

## are they equal?
identical(ans2, DT)    # [1] TRUE
identical(ans1, ans2)  # [1] TRUE

在此数据上,基准测试表明 data.table 的顺序比 base:::order~79 倍 并且 setorder 是 <比 base:::order 这里强>~135x 快.

On this data, benchmarks indicate that data.table's order is about ~79x faster than base:::order and setorder is ~135x faster than base:::order here.

data.table 总是在 C 语言环境中排序/排序.如果您需要在其他语言环境中订购,那么您才需要使用 DT[base:::order(.)].

data.table always sorts/orders in C-locale. If you should require to order in another locale, only then do you need to resort to using DT[base:::order(.)].

所有这些新的优化和功能共同构成 FR #2405.bit64::integer64 支持也已添加.

All these new optimisations and functions together constitute FR #2405. bit64::integer64 support also has been added.

注意:请参阅历史/修订以获取较早的答案和更新.

NOTE: Please refer to the history/revisions for earlier answer and updates.

这篇关于按升序/降序快速排序 data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆