从 data.table 中获取排序唯一值向量的最快方法是什么? [英] What is the fastest way to get a vector of sorted unique values from a data.table?

查看:13
本文介绍了从 data.table 中获取排序唯一值向量的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对此

正如预期的那样,变体 #4 和 #5(排序后唯一)非常慢.#8 是最快的,它证实了 Frank 的评论.

变种 #3 让我有点意外.尽管 data.table 的快速基数排序比 #1 和 #2 效率低.好像是先排序再提取唯一值.

基准测试,由 company

键入的 data.table

受此观察的启发,我使用 company 键入的 data.table 重复了基准测试.

setkeyv(salesdt, "company")

时间显示(请不要时间轴的比例变化)#4 和#5 已通过键控显着加速.它们甚至比#3 还要快.请注意,变体 #8 的时间安排包含在下一节中.

基准测试,稍作调整

Variant #3 仍然包含 order(company),如果已由 company 键入,则没有必要.因此,我从 #3 和 #5 中删除了对 ordersort 的不必要调用:

时间 <- microbenchmark::microbenchmark(排序(salesdt [,唯一(公司)]),排序(唯一(salesdt$company)),salesdt [, 独特的(公司)],独特的(salesdt$company),独特的(salesdt [,公司]),salesdt[, .N, by = company][order(-N), company],salesdt[, sum(sales), by = company][order(-V1), company],salesdt[,logical(1), keyby = company]$company)

时间现在在同一级别上显示变体 #1 到 #4. 同样,#8(Frank 的解决方案)是最快的.

警告:基准测试基于仅包含 5 个不同字母作为公司名称的原始数据.如果有更多不同的公司名称,结果可能会有所不同.使用 data.table v.1.9.7 获得结果.

The answer to this question (Unique sorted rows single column from R data.table) suggested three different ways to get a vector of sorted unique values from a data.table:

# 1
sort(salesdt[, unique(company)])
#2 
sort(unique(salesdt$company))
#3
salesdt[order(company), unique(company)]

Another answer suggested other sort options than lexicographical order:

salesdt[, .N, by = company][order(-N), company]
salesdt[, sum(sales), by = company][order(-V1), company]

The data.table was created by

library(data.table)
company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W")
item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit", 
          "Thingy", "Grommit", "Thingy", "Widget", "Thingy")
sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200)
salesdt <- data.table(company,item,sales) 

As always, if different options are available to choose from I started to wonder what the best solution would be, in particular if the data.table would be much larger. I have searched a bit on SO but haven't found a particular answer so far.

解决方案

For benchmarking, a larger data.table is created with 1.000.000 rows:

n <- 1e6
set.seed(1234) # to reproduce the data
salesdt <- data.table(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                      sales = sample(sales, n, TRUE))

For the sake of completeness also the variants

# 4
unique(sort(salesdt$company))
# 5
unique(salesdt[,sort(company)])

will be benchmarked although it seems to be obvious that sorting unique values should be faster than the other way around.

In addition, two other sort options from this answer are included:

# 6
salesdt[, .N, by = company][order(-N), company]
# 7
salesdt[, sum(sales), by = company][order(-V1), company]

Edit: Following from Frank's comment, I've included his suggestion:

# 8
salesdt[,logical(1), keyby = company]$company

Benchmarking, no key set

Benchmarking is done with help of the microbenchmark package:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[order(company), unique(company)],
  unique(sort(salesdt$company)),
  unique(salesdt[,sort(company)]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings are displayed with

ggplot2::autoplot(timings)

Please, note the reverse order in the chart (#1 at bottom, #8 at top).

As expected, variants #4 and #5 (unique after sort) are pretty slow. Edit: #8 is the fastest which confirms Frank's comment.

A bit of surprise to me was variant #3. Despite data.table's fast radix sort it is less efficient than #1 and #2. It seems to sort first and then to extract the unique values.

Benchmarking, data.table keyed by company

Motivated by this observation I repeated the benchmark with the data.table keyed by company.

setkeyv(salesdt, "company")

The timings show (please not the change in scale of the time axis) that #4 and #5 have been accelerated dramatically by keying. They are even faster than #3. Note that timings for variant #8 are included in the next section.

Benchmarking, keyed with a bit of tuning

Variant #3 still includes order(company) which isn't necessary if already keyed by company. So, I removed the unnecessary calls to order and sort from #3 and #5:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[, unique(company)],
  unique(salesdt$company),
  unique(salesdt[, company]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings now show variants #1 to #4 on the same level. Edit: Again, #8 (Frank's solution) is the fastests.

Caveat: The benchmarking is based on the original data which only includes 5 different letters as company names. It is likely that the result will look differently with a larger number of distinct company names. The results have been obtained with data.table v.1.9.7.

这篇关于从 data.table 中获取排序唯一值向量的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆