从数据表中获取排序的唯一值的向量的最快方法是什么? [英] What is the fastest way to get a vector of sorted unique values from a data.table?
问题描述
如预期,变体#4和#5(排序后唯一)都很慢。 编辑:#8是确认Frank的评论的最快的。
尽管 data.table
的快速radix排序效率低于#1和#2。
基准化,data.table由公司
由于这个观察,我以 data.table
/ code>。
setkeyv(salesdt,company)
时间显示(请不要按时间轴上的标度变化)#4和#5已经急剧加速通过键控。他们甚至比#3。
基准化,带有一点调整
变体#3还包括 )
,如果已由 company
键入,则不需要。所以,我从#3和#5中删除了对 order
和排序
的不必要的调用:
timings < - microbenchmark :: microbenchmark(
sort(salesdt [,unique(company)]),
sort (salesdt $ company),
salesdt [,unique(company)],
unique(salesdt $ company),
unique(salesdt [,company]),
salesdt [ ,.N,by = company] [order(-N),company],
salesdt [,sum(sales),by = company] [order(-V1),company],
salesdt [ ,logical(1),keyby = company] $ company
)
在同一水平上的变体#1至#4。 修改:再次,#8(弗兰克的解决方案)是最快的。
注意事项 :基准测试基于原始数据,只包含5个不同的字母作为公司名称。很可能的是,结果将与更大数量的不同公司名称看起来不同。结果已通过 data.table v.1.9.7
获得。
The answer to this question (Unique sorted rows single column from R data.table) suggested three different ways to get a vector of sorted unique values from a data.table
:
# 1
sort(salesdt[, unique(company)])
#2
sort(unique(salesdt$company))
#3
salesdt[order(company), unique(company)]
Another answer suggested other sort options than lexicographical order:
salesdt[, .N, by = company][order(-N), company]
salesdt[, sum(sales), by = company][order(-V1), company]
The data.table
was created by
library(data.table)
company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W")
item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit",
"Thingy", "Grommit", "Thingy", "Widget", "Thingy")
sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200)
salesdt <- data.table(company,item,sales)
As always, if different options are available to choose from I started to wonder what the best solution would be, in particular if the data.table
would be much larger. I have searched a bit on SO but haven't found a particular answer so far.
Therefore, I did some benchmarking of the proposed solutions to that particular question which I would like to share in my answer below.
For benchmarking, a larger data.table
is created with 1.000.000 rows:
n <- 1e6
set.seed(1234) # to reproduce the data
salesdt <- data.table(company = sample(company, n, TRUE),
item = sample(item, n, TRUE),
sales = sample(sales, n, TRUE))
For the sake of completeness also the variants
# 4
unique(sort(salesdt$company))
# 5
unique(salesdt[,sort(company)])
will be benchmarked although it seems to be obvious that sorting unique values should be faster than the other way around.
In addition, two other sort options from this answer are included:
# 6
salesdt[, .N, by = company][order(-N), company]
# 7
salesdt[, sum(sales), by = company][order(-V1), company]
Edit: Following from Frank's comment, I've included his suggestion:
# 8
salesdt[,logical(1), keyby = company]$company
Benchmarking, no key set
Benchmarking is done with help of the microbenchmark
package:
timings <- microbenchmark::microbenchmark(
sort(salesdt[, unique(company)]),
sort(unique(salesdt$company)),
salesdt[order(company), unique(company)],
unique(sort(salesdt$company)),
unique(salesdt[,sort(company)]),
salesdt[, .N, by = company][order(-N), company],
salesdt[, sum(sales), by = company][order(-V1), company],
salesdt[,logical(1), keyby = company]$company
)
The timings are displayed with
ggplot2::autoplot(timings)
Please, note the reverse order in the chart (#1 at bottom, #8 at top).
As expected, variants #4 and #5 (unique after sort) are pretty slow. Edit: #8 is the fastest which confirms Frank's comment.
A bit of surprise to me was variant #3. Despite data.table
's fast radix sort it is less efficient than #1 and #2. It seems to sort first and then to extract the unique values.
Benchmarking, data.table keyed by company
Motivated by this observation I repeated the benchmark with the data.table
keyed by company
.
setkeyv(salesdt, "company")
The timings show (please not the change in scale of the time axis) that #4 and #5 have been accelerated dramatically by keying. They are even faster than #3. Note that timings for variant #8 are included in the next section.
Benchmarking, keyed with a bit of tuning
Variant #3 still includes order(company)
which isn't necessary if already keyed by company
. So, I removed the unnecessary calls to order
and sort
from #3 and #5:
timings <- microbenchmark::microbenchmark(
sort(salesdt[, unique(company)]),
sort(unique(salesdt$company)),
salesdt[, unique(company)],
unique(salesdt$company),
unique(salesdt[, company]),
salesdt[, .N, by = company][order(-N), company],
salesdt[, sum(sales), by = company][order(-V1), company],
salesdt[,logical(1), keyby = company]$company
)
The timings now show variants #1 to #4 on the same level. Edit: Again, #8 (Frank's solution) is the fastests.
Caveat: The benchmarking is based on the original data which only includes 5 different letters as company names. It is likely that the result will look differently with a larger number of distinct company names. The results have been obtained with data.table v.1.9.7
.
这篇关于从数据表中获取排序的唯一值的向量的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!