大量行之间的差异 [英] Difference between large number of rows
问题描述
我有一个矩阵,行数非常大,只有两个成对列。我想计算列1中每行之间的差异,如果差异小于预定义值(.001),则计算两列中的行的平均值。例如,我有一个称为权重的矩阵,
AB
185.0765 10
185.3171 20
186.0777 30
186.0780 40
188.0078 50
weight< -as.data.table(weights)
bins< -weights [%% 3],(A [3] + 001))]
meanA <-mean(bins $ A)
meanB <-mean(bins $ B)
,结果矩阵为
AB
185.0765 10
185.3171 20
186.0779 35
188.0078 50
$ b b如果有人可以请大家告诉我如何为大量行执行此操作,我将非常感谢。我认为使用for循环将不是很有效率。
解决方案这应该实现你想做的,使用
data.table
:DT< - data.table(weights)
DT [,Group:=(cumsum(c(1,ifelse(diff(weights $ A)<0.001,0,1)))]
DT [,lapply(.SD,mean) by = Group,.SDcols = c(A,B)]
#Group AB
#1:1 185.0765 10
#2:2 185.3171 20
# 3:3 186.0779 35
#4:4 188.0078 50
累加和以找到具有
A
的差异的A
0.001。如果差值低于此阈值,我们在Group
列中放入0
,因此在累积和中
根据 @eddi 的建议,更简洁,更有效的方式是进行分组和计算所有在同一时间,在一个调用:
DT < - data.table(weights)
DT [,lapply(.SD,mean),by = list(Group = cumsum(c(1,diff(A))> = 0.001)),.SDcols = c(A,B)]
另外,绝对行数也是有帮助的。 非常大的行对于不同的人和用例意味着不同的东西。我们说百万吗?数亿?
I have a matrix with very large number of rows and only two paired columns. I want to calculate the differences between each rows in column 1 and if the difference is less than a predefined value(.001) then calculate the average of those rows in both columns. For example I have a matrix called weights,
A B 185.0765 10 185.3171 20 186.0777 30 186.0780 40 188.0078 50 weights<-as.data.table(weights) bins<-weights[A %between% c(A[3],(A[3]+.001))] meanA<-mean(bins$A) meanB<-mean(bins$B)
and the resulting matrix will be,
A B 185.0765 10 185.3171 20 186.0779 35 188.0078 50
I would be thankful if someone could please advice me how to do this for large number of rows. I think using a for loop would not be very efficient.
解决方案This should achieve what you want to do, using
data.table
:DT <- data.table( weights ) DT[ , Group :=( cumsum( c( 1 , ifelse( diff(weights$A) < 0.001 , 0 , 1 ) ) ) ) ] DT[ , lapply(.SD, mean) , by=Group , .SDcols = c("A","B") ] # Group A B #1: 1 185.0765 10 #2: 2 185.3171 20 #3: 3 186.0779 35 #4: 4 188.0078 50
The idea is we use a cumulative sum to find the groups of
A
that have a difference of < 0.001. If the difference is under this threshold we put a0
in ourGroup
column, so in the cumulative sum it will be part of the same group.As suggested by @eddi a more succinct and efficient way of doing this would be to do the grouping and the calculation all at the same time, in one call:
DT <- data.table( weights ) DT[ , lapply(.SD, mean) , by = list(Group = cumsum(c(1,diff(A)) >= 0.001)) , .SDcols = c("A","B") ]
As an aside, it is always helpful to have an absolute number of rows. A very large number of rows mean different things to different people and use-cases. Are we talking million? Hundreds of millions?
这篇关于大量行之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!