大量行之间的差异 [英] Difference between large number of rows

查看:129
本文介绍了大量行之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个矩阵,行数非常大,只有两个成对列。我想计算列1中每行之间的差异,如果差异小于预定义值(.001),则计算两列中的行的平均值。例如,我有一个称为权重的矩阵,

  AB 
185.0765 10
185.3171 20
186.0777 30
186.0780 40
188.0078 50

weight< -as.data.table(weights)
bins< -weights [%% 3],(A [3] + 001))]
meanA <-mean(bins $ A)
meanB <-mean(bins $ B)



,结果矩阵为

  AB 
185.0765 10
185.3171 20
186.0779 35
188.0078 50


$ b b

如果有人可以请大家告诉我如何为大量行执行此操作,我将非常感谢。我认为使用for循环将不是很有效率。

解决方案

这应该实现你想做的,使用 data.table

  DT<  -  data.table(weights)
DT [,Group:=(cumsum(c(1,ifelse(diff(weights $ A)<0.001,0,1)))]
DT [,lapply(.SD,mean) by = Group,.SDcols = c(A,B)]
#Group AB
#1:1 185.0765 10
#2:2 185.3171 20
# 3:3 186.0779 35
#4:4 188.0078 50

累加和以找到具有 A 的差异的 A 0.001。如果差值低于此阈值,我们在 Group 列中放入 0 ,因此在累积和中



根据 @eddi 的建议,更简洁,更有效的方式是进行分组和计算所有在同一时间,在一个调用:

  DT < -  data.table(weights)
DT [,lapply(.SD,mean),by = list(Group = cumsum(c(1,diff(A))> = 0.001)),.SDcols = c(A,B)]

另外,绝对行数也是有帮助的。 非常大的行对于不同的人和用例意味着不同的东西。我们说百万吗?数亿?


I have a matrix with very large number of rows and only two paired columns. I want to calculate the differences between each rows in column 1 and if the difference is less than a predefined value(.001) then calculate the average of those rows in both columns. For example I have a matrix called weights,

  A      B
185.0765 10
185.3171 20
186.0777 30
186.0780 40
188.0078 50

weights<-as.data.table(weights)
bins<-weights[A %between% c(A[3],(A[3]+.001))]
meanA<-mean(bins$A)
meanB<-mean(bins$B)

and the resulting matrix will be,

  A      B
185.0765 10
185.3171 20
186.0779 35
188.0078 50

I would be thankful if someone could please advice me how to do this for large number of rows. I think using a for loop would not be very efficient.

解决方案

This should achieve what you want to do, using data.table:

DT <- data.table( weights )
DT[ , Group :=( cumsum( c( 1 , ifelse( diff(weights$A) < 0.001 , 0 , 1 ) ) ) ) ]
DT[ , lapply(.SD, mean) , by=Group ,  .SDcols = c("A","B") ]
#   Group        A  B
#1:     1 185.0765 10
#2:     2 185.3171 20
#3:     3 186.0779 35
#4:     4 188.0078 50

The idea is we use a cumulative sum to find the groups of A that have a difference of < 0.001. If the difference is under this threshold we put a 0 in our Group column, so in the cumulative sum it will be part of the same group.

As suggested by @eddi a more succinct and efficient way of doing this would be to do the grouping and the calculation all at the same time, in one call:

DT <- data.table( weights )
DT[ , lapply(.SD, mean) , by = list(Group = cumsum(c(1,diff(A)) >= 0.001)) ,  .SDcols = c("A","B") ]    

As an aside, it is always helpful to have an absolute number of rows. A very large number of rows mean different things to different people and use-cases. Are we talking million? Hundreds of millions?

这篇关于大量行之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆