子集数据表的最快方法是什么? [英] What's the fastest way to subset a data.table?

查看:84
本文介绍了子集数据表的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我看来,处理 data.table 的行/列子集的最快方法是使用join和 nomatch 选项。

It seems to me the fastest way to do a row/col subset of a data.table is to use the join and nomatch option.

DT = data.table(rep(1:100, 100000), rep(1:10, 1000000))
setkey(DT, V1, V2)
system.time(DT[J(22,2), nomatch=0L])
# user  system elapsed 
# 0.00    0.00    0.01 
system.time(subset(DT, (V1==22) & (V2==2)))
# user  system elapsed 
# 0.45    0.21    0.67 

identical(DT[J(22,2), nomatch=0L],subset(DT, (V1==22) & (V2==2)))
# [1] TRUE

基于二进制搜索的快速连接也有一个问题:我找不到一种方法来选择一维中的所有项目。

I also have one problem with the fast join based on binary search: I cannot find a way to select all items in one dimension.

请问是否要随后执行此操作:

Say if I want to subsequently do:

DT[J(22,2), nomatch=0]  # subset on TWO dimensions
DT[J(22,), nomatch=0]   # subset on ONE dimension only
# Error in list(22, ) : argument 2 is empty

而不必将密钥重新设置为仅一维(因为我处于循环状态,并且我不想每次都放下密钥)。 / p>

without having to re-set the key to only one dimension (because I am in a loop and I don't want to rest the keys every time).

推荐答案

子集 data.table 的最快方法是什么?

使用基于二进制搜索的子集功能最快。请注意,该子集需要选项 nomatch = 0L ,以便仅返回匹配结果。

What's the fastest way to subset a data.table?

Using the binary search based subset feature is the fastest. Note that the subset requires the option nomatch = 0L so as to return only the matching results.

如果您在 DT 上设置了两个键,并且想要通过第一个键 子集,那么您只需在 J(。)中提供第一个值,否需要提供第二个密钥的任何内容。也就是说:

If you've two keys set on DT and you want to subset by the first key, then you can just provide the first value in J(.), no need to provide anything for the 2nd key. That is:

# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L] 

如果相反,您想通过第二个键进行子集,那么到目前为止,您必须为第一个键提供所有唯一值。也就是说:

If instead, you would like to subset by the second key, then you'll have to, as of now, provide all the unique values for the first key. That is:

# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]

这也显示为在此SO帖子中。尽管我希望 DT [J(,2)] 可以用于这种情况,因为这似乎很直观。

This is also shown in this SO post. Although I'd prefer that DT[J(, 2)] to work for this case, as that seems rather intuitive.

还有一个待处理的功能请求, FR#1007 用于实现辅助密钥,完成后将解决此问题。

There's also a pending feature request, FR #1007 for implementing secondary keys, which when done would take care of this.

下面是一个更好的示例:

Here is a better example:

DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2))
DT
#    V1 V2
# 1:  1  2
# 2:  2  3
# 3:  3  2
# 4:  4  3
# 5:  5  2
setkey(DT,V1,V2)
DT[J(unique(V1),2)]
#    V1 V2
# 1:  1  2
# 2:  2  2
# 3:  3  2
# 4:  4  2
# 5:  5  2
DT[J(unique(V1),2), nomatch=0L]
#    V1 V2
# 1:  1  2
# 2:  3  2
# 3:  5  2
DT[J(3), nomatch=0L]
#    V1 V2
# 1:  3  2

总结:

# key(DT) = c("V1", "V2")

# data.frame                        |             data.table equivalent
# =====================================================================
# subset(DF, (V1 == 3) & (V2 == 2)) |            DT[J(3,2), nomatch=0L]
# subset(DF, (V1 == 3))             |              DT[J(3), nomatch=0L]
# subset(DF, (V2 == 2))             |  DT[J(unique(V1), 2), nomatch=0L]

这篇关于子集数据表的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆