聚合在80K以上的唯一ID [英] Aggregating in R over 80K unique ID's
问题描述
关于大数据的另一个新手问题。我使用一个大数据集(3.5m行)与时间序列数据。我想用一个列创建一个 data.table
,该列首次找到唯一标识符。
Another novice question regarding big data. I'm working with a large dataset (3.5m rows) with time series data. I want to create a data.table
with a column that finds the first time the unique identifier appears.
df是一个 data.table
, df $ timestamp
是 POSIXct
类中的日期, df $ id
是唯一的数字标识符。我使用以下代码:
df is a data.table
, df$timestamp
is a date in class POSIXct
, and df$id
is the unique numeric identifier. I'm using the following code:
# UPDATED - DATA KEYED
setkey(df, id)
sub_df<-df[,(min(timestamp)), by=list(id)] # Finding first timestamp for each unique ID
这里是catch。我汇总了超过80,000个唯一ID。 R窒息。
Here's the catch. I'm aggregating over 80k unique ID's. R is choking. Anything I can do to optimize my approach?
推荐答案
正如@Arun所提到的,真正的关键使用适当的 data.table
语法而不是 setkey
。
As mentioned by @Arun, the real key (no pun intended) is the use of proper data.table
syntax rather than setkey
.
df[, min(timestamp), by=id]
虽然80k的唯一ID听起来很多,但使用 data.table
的键
功能可以使它变得易于管理前景。
While 80k unique ids sounds like a lot, using the key
feature of data.table
can make it a manageable prospect.
setkey(df, id)
然后像以前一样处理。对于它的价值,你可以经常使用排序的键的愉快的副作用。
Then process as before. For what its worth, you can often use a pleasant side effect of keys which is sorting.
set.seed(1)
dat <- data.table(x = sample(1:10, 10), y = c('a', 'b'))
x y
1: 3 a
2: 4 b
3: 5 a
4: 7 b
5: 2 a
6: 8 b
7: 9 a
8: 6 b
9: 10 a
10: 1 b
setkey(dat, y, x)
x y
1: 2 a
2: 3 a
3: 5 a
4: 9 a
5: 10 a
6: 1 b
7: 4 b
8: 6 b
9: 7 b
10: 8 b
然后 min
或另一个更复杂的函数只是一个子集合操作:
Then the min
or another more complex function is just a subset operation:
dat[, .SD[1], by=y]
这篇关于聚合在80K以上的唯一ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!