以更快的方式计算R中列中不同ID的特征 [英] Count features for different ids in columns in R in faster way

查看:97
本文介绍了以更快的方式计算R中列中不同ID的特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在R中处理20 GB的数据文件.我有16 gigs RAM和i7处理器.我正在使用:

I am trying to process a 20 GB data file in R. I have 16 gigs RAM and i7 processor. I am reading the data using :

y<-read.table(file="sample.csv", header = TRUE, sep = ",", skip =0, nrows =50000000)

数据集"y"如下:

id    feature

21    234
21    290
21    234
21    7802
21    3467
21    234
22    235
22    235
22    1234
22    236
22    134
23    9133
23    223
23    245
23    223  
23    122
23    223 

因此,上面是示例数据集,其中显示了特定ID的不同功能.我想统计在另一个数据集中x中列出的某个特定功能在y中的ID发生了多少次.

So above is sample dataset, which shows different features for a particular id. I want to count how many times a particular feature listed in another dataset x has occurred for an id in y.

数据集x如下:

id    feature

   21      234
   22      235
   23      223

我想要的最终输出如下:

And the final output that I want is as follows:

 id    feature_count

   21      3
   22      2
   23      3

我们看到234发生了三次,发生了21次,235发生了两次,发生了22次,223发生了两次,发生了23次.

As we see 234 occurred thrice for 21, 235 occurred twice for 22 and 223 occurred twice for 23.

为此,我尝试获取新ID开始的位置:(例如,上面示例的第1、7和12个位置),然后使用for循环对特征进行计数,如下所示:

For this I have tried getting positions where the new id starts: (eg 1st, 7th and 12th position for above sample) and then count a feature using a for loop as follows:

positions=0
positions[1]=1
j=2
for(i in 1:50000000){
    if(y$id[i]!=y$id[i+1]){
    positions[j]=i+1
    j=j+1
  }
}

由于数据量巨大,因此循环会花费大量时间.(对于5000万行,在上述配置PC上需要321秒,而我有3亿行).

Since the data is huge the looping is taking a lots of time.(for 50 Million rows it takes 321 secs on above mentioned config PC and I have 300 Million rows).

for(i in 1 :length(positions)){
  for(j in positions[i]:positions[i+1]){
    if(y$feature[j]==x$feature[i]){         
       feature_count[i]=feature_count[i]+1
    }
  }
}

是否有任何R函数可以共同为我更快地完成这项工作. 同样使用"positions [i]:positions [i + 1]"递增for循环会引发错误,指出NA变量在for循环中.请提出正确的方法.

Are there any R functions which can collectively do this job for me in a faster time. Also incrementing for loop using "positions[i]:positions[i+1]" throws an error saying NA arguments in for loop. Please suggest a right way to do that too.

推荐答案

我承认我不太了解问题的编写方式,但听起来"data.table"将是解决之道,您应该查看.N函数.如前所述,fread将比read.csv好得多,因此,我假设您已将数据读入名为"DT"的data.table中.

I admit that I don't really understand the question the way it is written, but it sounds like "data.table" would be the way to go, and you should look into the .N function. As already mentioned fread is going to be much better than read.csv, so I'll assume that you've read the data into a data.table named "DT".

这里很小:

DT <- data.table(id = c(rep(21, 6), rep(22, 5), 23, 23),
                 feature = c(234, 290, 234, 7802, 3467, 234, 235,
                             235, 1234, 236, 134, 9133, 223))
DT
#     id feature
#  1: 21     234
#  2: 21     290
#  3: 21     234
#  4: 21    7802
#  5: 21    3467
#  6: 21     234
#  7: 22     235
#  8: 22     235
#  9: 22    1234
# 10: 22     236
# 11: 22     134
# 12: 23    9133
# 13: 23     223

如果您只想计算每个独特功能的数量,则可以执行以下操作:

If you just wanted to count the number of each unique feature, you could do:

DT[, .N, by = "id,feature"]
#     id feature N
#  1: 21     234 3
#  2: 21     290 1
#  3: 21    7802 1
#  4: 21    3467 1
#  5: 22     235 2
#  6: 22    1234 1
#  7: 22     236 1
#  8: 22     134 1
#  9: 23    9133 1
# 10: 23     223 1

如果要通过来计数 first 功能",可以使用:

If you wanted the count of the first "feature", by "id", you could use:

DT[, .N, by = "id,feature"][, .SD[1], by = "id"]
#    id feature N
# 1: 21     234 3
# 2: 22     235 2
# 3: 23    9133 1

如果您想通过"id"获得最频繁出现的功能"(在这种情况下,与上面的结果相同),可以尝试以下操作:

If you wanted to get the most frequently occurring "feature" by "id" (which is the same result as above, in this case), you can try the following:

DT[, .N, by = "id,feature"][, lapply(.SD, function(x) x[which.max(N)]), by = "id"]


更新

根据您的新描述,这似乎容易得多.


Update

Based on your new description, this seems much easier.

只需merge您的数据集和aggregate计数.再次,在"data.table"中快速完成此操作:

Just merge your datasets and aggregate the counts. Again, fast to do in "data.table":

DTY <- data.table(y, key = "id,feature")
DTX <- data.table(x, key = "id,feature")
DTY[DTX][, .N, by = id]
#    id N
# 1: 21 3
# 2: 22 2
# 3: 23 3

或者:

DTY[, .N, by = key(DTY)][DTX]
#    id feature N
# 1: 21     234 3
# 2: 22     235 2
# 3: 23     223 3

这是假设"x"和"y"的定义如下:

This is assuming that "x" and "y" are defined as the following to begin with:

x <- structure(list(id = 21:23, feature = c(234L, 235L, 223L),
  counts = c(3L, 2L, 3L)), .Names = c("id", "feature", "counts"),
  row.names = c(NA, -3L), class = "data.frame")
y <- structure(list(id = c(21L, 21L, 21L, 21L, 21L, 21L, 22L, 22L, 
  22L, 22L, 22L, 23L, 23L, 23L, 23L, 23L, 23L), feature = c(234L,
  290L, 234L, 7802L, 3467L, 234L, 235L, 235L, 1234L, 236L, 134L,
  9133L, 223L, 245L, 223L, 122L, 223L)), .Names = c("id", "feature"),
  class = "data.frame", row.names = c(NA, -17L))

这篇关于以更快的方式计算R中列中不同ID的特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆