聚合在80K以上的唯一ID [英] Aggregating in R over 80K unique ID's

查看:153
本文介绍了聚合在80K以上的唯一ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于大数据的另一个新手问题。我使用一个大数据集(3.5m行)与时间序列数据。我想用一个列创建一个 data.table ,该列首次找到唯一标识符。

Another novice question regarding big data. I'm working with a large dataset (3.5m rows) with time series data. I want to create a data.table with a column that finds the first time the unique identifier appears.

df是一个 data.table df $ timestamp POSIXct 类中的日期, df $ id 是唯一的数字标识符。我使用以下代码:

df is a data.table, df$timestamp is a date in class POSIXct, and df$id is the unique numeric identifier. I'm using the following code:

# UPDATED - DATA KEYED
setkey(df, id)
sub_df<-df[,(min(timestamp)), by=list(id)] # Finding first timestamp for each unique ID

这里是catch。我汇总了超过80,000个唯一ID。 R窒息。

Here's the catch. I'm aggregating over 80k unique ID's. R is choking. Anything I can do to optimize my approach?

推荐答案

正如@Arun所提到的,真正的关键使用适当的 data.table 语法而不是 setkey

As mentioned by @Arun, the real key (no pun intended) is the use of proper data.table syntax rather than setkey.

df[, min(timestamp), by=id]

虽然80k的唯一ID听起来很多,但使用 data.table 功能可以使它变得易于管理前景。

While 80k unique ids sounds like a lot, using the key feature of data.table can make it a manageable prospect.

setkey(df, id)

然后像以前一样处理。对于它的价值,你可以经常使用排序的键的愉快的副作用。

Then process as before. For what its worth, you can often use a pleasant side effect of keys which is sorting.

set.seed(1)
dat <- data.table(x = sample(1:10, 10), y = c('a', 'b'))

    x y
 1:  3 a
 2:  4 b
 3:  5 a
 4:  7 b
 5:  2 a
 6:  8 b
 7:  9 a
 8:  6 b
 9: 10 a
10:  1 b

setkey(dat, y, x)

     x y
 1:  2 a
 2:  3 a
 3:  5 a
 4:  9 a
 5: 10 a
 6:  1 b
 7:  4 b
 8:  6 b
 9:  7 b
10:  8 b

然后 min 或另一个更复杂的函数只是一个子集合操作:

Then the min or another more complex function is just a subset operation:

dat[, .SD[1], by=y]

这篇关于聚合在80K以上的唯一ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆