为什么此循环的时间复杂度是非线性的? [英] Why is the time complexity of this loop non-linear?

查看:251
本文介绍了为什么此循环的时间复杂度是非线性的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么这个循环的时间复杂度是非线性的,为什么这么慢?循环采用~38s for N=50k,~570s for N=200k.有没有更快的方法可以做到这一点? Rprof()似乎表明写入内存的速度很慢.

Why is the time complexity of this loop non-linear and why is it so slow? The loop takes ~38s for N=50k, and ~570s for N=200k. Is there a faster way to do this? Rprof() seems to indicate that writing to memory is very slow.

df <- data.frame(replicate(5, runif(200000)))
df[,1:3] <- round(df[,1:3])

Rprof(line.profiling = TRUE); timer <- proc.time()
x <- df; N <- nrow(df); i <- 1 
ind <- df[1:(N-1),1:3] == df[2:N,1:3]; 
rind <- which(apply(ind,1,all))
N <- length(rind)
while(i <= N)
{
    x$X4[rind[i]+1] <- x$X4[rind[i]+1] + x$X4[rind[i]]
    x$X5[rind[i]+1] <- x$X4[rind[i]+1] * x$X3[rind[i]+1]
    x$X5[rind[i]+1] <- trunc(x$X5[rind[i]+1]*10^8)/10^8
    x$X1[rind[i]] <- NA
    i <- i + 1
};x <- na.omit(x)
proc.time() - timer; Rprof(NULL)
summaryRprof(lines = "show")

此算法的目的是遍历数据帧并组合在某些元素上匹配的相邻行.也就是说,它将删除其中的一行并将该行的某些值添加到另一行.所得数据帧应减少n行,其中n是原始数据帧中匹配的相邻行数.每次合并一对行时,源数据帧和新数据帧的索引不同步1,因为从新帧中删除/省略了一行,因此i会跟踪源数据帧,并且q跟踪新数据帧上的位置.

The purpose of this algorithm is to iterate over the data frame and combine adjacent rows that match on certain elements. That is, it removes one of the rows and adds some of that row's values to the other row. The resulting data frame should have n less rows, where n is the number of matching adjacent rows in the original data frame. Every time a pair of rows are combined, the index of the source data frame and new data frame get out of sync by 1, since one row is removed/omitted from the new frame, so i keeps track of the position on the source data frame, and q keeps track of the position on the new data frame.

由于@joran的注释,上面的代码已更新.性能大大提高到~5.5s for N=50k~88s for N=200k.但是,时间复杂度仍然是非线性的,我无法理解.我需要以N = 1百万或更多来运行它,所以它的速度仍然不是很好.

The code above is updated thanks to @joran's comment. The performance is improved substantially to ~5.5s for N=50k and ~88s for N=200k. However, the time complexity is still non-linear, which I can't fathom. I need to run this at N = 1 million or more, so its still not great speed.

推荐答案

X4列更新取决于以前的值,因此可以对循环进行大部分的矢量化处理"(进行一些优化,避免加1到rind(每次迭代))

Only the X4 column update depends on previous values, so the loop can be mostly 'vectorized' (with a little bit of optimization, avoiding addition of 1 to rind in each iteration) as

rind1 <- rind + 1L
for (i in seq_len(N))
    x$X4[rind1[i]] <- x$X4[rind1[i]] + x$X4[rind[i]]

x$X5[rind1] <- x$X4[rind1] * x$X3[rind1]
x$X5[rind1] <- trunc(x$X5[rind1] * 10^8) / 10^8
x$X1[rind] <- NA
na.omit(x)

X4是一个数字值,可以通过将其更新为向量而不是data.frame的列来提高更新效率.

X4 is a numeric value and the update can be made more efficient by updating it as a vector rather than a column of a data.frame

X4 <- x$X4
for (i in seq_len(N))
    X4[rind1[i]] <- X4[rind1[i]] + X4[rind[i]]
x$X4 <- X4

为了比较,我们有

f0 <- function(nrow) {
    set.seed(123)
    df <- data.frame(replicate(5, runif(nrow)))
    df[,1:3] <- round(df[,1:3])
    x <- df; N <- nrow(df); i <- 1 
    ind <- df[1:(N-1),1:3] == df[2:N,1:3]; 
    rind <- which(apply(ind,1,all))
    N <- length(rind)

    while(i <= N)
    {
        x$X4[rind[i]+1] <- x$X4[rind[i]+1] + x$X4[rind[i]]
        x$X5[rind[i]+1] <- x$X4[rind[i]+1] * x$X3[rind[i]+1]
        x$X5[rind[i]+1] <- trunc(x$X5[rind[i]+1]*10^8)/10^8
        x$X1[rind[i]] <- NA
        i <- i + 1
    }
    na.omit(x)
}

f1a <- function(nrow) {
    set.seed(123)
    df <- data.frame(replicate(5, runif(nrow)))
    df[,1:3] <- round(df[,1:3])
    x <- df; N <- nrow(df)
    ind <- df[1:(N-1),1:3] == df[2:N,1:3]; 
    rind <- which(apply(ind,1,all))  

    rind1 <- rind + 1L
    for (i in seq_along(rind))
        x$X4[rind1[i]] <- x$X4[rind1[i]] + x$X4[rind[i]]

    x$X5[rind1] <- x$X4[rind1] * x$X3[rind1]
    x$X5[rind1] <- trunc(x$X5[rind1] * 10^8) / 10^8
    x$X1[rind] <- NA
    na.omit(x)
}

f4a <- function(nrow) {
    set.seed(123)
    df <- data.frame(replicate(5, runif(nrow)))
    df[,1:3] <- round(df[,1:3])
    x <- df; N <- nrow(df) 
    ind <- df[1:(N-1),1:3] == df[2:N,1:3]; 
    rind <- which(apply(ind,1,all))

    rind1 <- rind + 1L
    X4 <- x$X4
    for (i in seq_along(rind))
        X4[rind1[i]] <- X4[rind1[i]] + X4[rind[i]]
    x$X4 <- X4

    x$X1[rind] <- NA
    x$X5[rind1] <- X4[rind1] * x$X3[rind1]
    x$X5[rind1] <- trunc(x$X5[rind1] * 10^8) / 10^8

    na.omit(x)
}

结果相同

> identical(f0(1000), f1a(1000))
[1] TRUE
> identical(f0(1000), f4a(1000))
[1] TRUE

加速非常快(使用library(microbenchmark))

> microbenchmark(f0(10000), f1a(10000), f4a(10000), times=10)
Unit: milliseconds
       expr       min        lq      mean    median        uq       max neval
  f0(10000) 346.35906 354.37637 361.15188 363.71627 366.74944 373.88275    10
 f1a(10000) 124.71766 126.43532 127.99166 127.39257 129.51927 133.01573    10
 f4a(10000)  41.70401  42.48141  42.90487  43.00584  43.32059  43.83757    10

差异的原因可以在启用内存分析的情况下编译R时看到-

The reason for the difference can be seen when R has been compiled with memory profiling enabled --

> tracemem(x)
[1] "<0x39d93a8>"
> tracemem(x$X4)
[1] "<0x6586e40>"
> x$X4[1] <- 1
tracemem[0x39d93a8 -> 0x39d9410]: 
tracemem[0x6586e40 -> 0x670d870]: 
tracemem[0x39d9410 -> 0x39d9478]: 
tracemem[0x39d9478 -> 0x39d94e0]: $<-.data.frame $<- 
tracemem[0x39d94e0 -> 0x39d9548]: $<-.data.frame $<- 
>

每行表示一个内存副本,因此更新数据帧中的单元会产生5个外部结构或矢量本身的副本.相反,向量可以没有任何副本地进行更新.

Each line indicates a memory copy, so updating a cell in a data frame incurs 5 copies of the outer structure or the vector itself. In contrast, a vector can be updated without any copies.

> tracemem(X4)
[1] "<0xdd44460>"
> X4[1] = 1
tracemem[0xdd44460 -> 0x9d26c10]: 
> X4[1] = 2
>

(第一个分配很昂贵,因为它表示data.frame列的重复;后续更新是对X4的更改,只有X4表示要更新的向量,并且该向量不需要重复)

(The first assignment is expensive because it represents the duplication of the data.frame column; subsequent updates are to X4, only X4 refers to the vector being updated, and the vector does not need to be duplicated).

data.frame实现似乎确实是非线性扩展的

The data.frame implementation does seem to scale non-linearly

> microbenchmark(f1a(100), f1a(1000), f1a(10000), f1a(100000), times=10)
Unit: milliseconds
       expr         min          lq        mean      median          uq
   f1a(100)    2.372266    2.479458    2.551568    2.524818    2.640244
  f1a(1000)   10.831288   11.100009   11.210483   11.194863   11.432533
 f1a(10000)  130.011104  138.686445  139.556787  141.138329  141.522686
 f1a(1e+05) 4092.439956 4117.818817 4145.809235 4143.634663 4172.282888
         max neval
    2.727221    10
   11.581644    10
  147.993499    10
 4216.129732    10

原因在上面tracemem输出的第二行中很明显-更新行会触发整个列的副本.因此,算法会按行数进行缩放,以更新列中的行数,大约是二次方.

The reason is apparent in the second line of the tracemem output above -- updating a row triggers a copy of the entire column. So the algorithm scales as the number of rows to update times the number of rows in a column, approximately quadratic.

f4a()似乎呈线性比例

> microbenchmark(f4a(100), f4a(1000), f4a(10000), f4a(100000), f4a(1e6), times=10)
Unit: milliseconds
       expr         min          lq        mean      median          uq
   f4a(100)    1.741458    1.756095    1.827886    1.773887    1.929943
  f4a(1000)    5.286016    5.517491    5.558091    5.569514    5.671840
 f4a(10000)   42.906895   43.025385   43.880020   43.928631   44.633684
 f4a(1e+05)  467.698285  478.919843  539.696364  552.896109  576.707913
 f4a(1e+06) 5385.029968 5521.645185 5614.960871 5573.475270 5794.307470
         max neval
    2.003700    10
    5.764022    10
   44.983002    10
  644.927832    10
 5823.868167    10

人们可以尝试并聪明地对循环进行矢量化处理,但是现在有必要吗?

One could try and be clever about vectorizing the loop, but is it now necessary?

该函数的数据处理部分的调整版本使用负索引(例如,-nrow(df))从数据框中删除行,而rowSums()代替apply()unname(),以便子集操作不执行不要携带未使用的名称:

A tuned version of the data processing part of the function uses negative indexing (e.g., -nrow(df)) to remove rows from the data frame, rowSums() instead of apply(), and unname() so that subset operations don't carry around unused names:

g0 <- function(df) {
    ind <- df[-nrow(df), 1:3] == df[-1, 1:3]
    rind <- unname(which(rowSums(ind) == ncol(ind)))
    rind1 <- rind + 1L

    X4 <- df$X4
    for (i in seq_along(rind))
        X4[rind1[i]] <- X4[rind1[i]] + X4[rind[i]]

    df$X4 <- X4
    df$X1[rind] <- NA
    df$X5[rind1] <- trunc(df$X4[rind1] * df$X3[rind1] * 10^8) / 10^8

    na.omit(df)
}

与@Khashaa建议的data.table解决方案相比

Compared to the data.table solution suggested by @Khashaa

g1 <- function(df) {
    x <- setDT(df)[, r:=rleid(X1, X2, X3),]
    x <- x[, .(X1=X1[.N], X2=X2[.N], X3=X3[.N], X4=sum(X4), X5=X5[.N]), by=r]
    x <- x[, X5:= trunc(X3 * X4 * 10^8)/10^8]
    x
}

基本R版本与时俱进

> n_row <- 200000
> set.seed(123)
> df <- data.frame(replicate(5, runif(n_row)))
> df[,1:3] <- round(df[,1:3])
> system.time(g0res <- g0(df))
   user  system elapsed 
  0.247   0.000   0.247 
> system.time(g1res <- g1(df))
   user  system elapsed 
  0.551   0.000   0.551 

(f4a中的预调整版本大约需要760ms,因此速度要慢两倍多.)

(The pre-tuning version in f4a takes about 760ms, so more than twice as slow).

data.table实现的结果不正确

The results from the data.table implementation are not correct

> head(g0res)
  X1 X2 X3        X4        X5
1  0  1  1 0.4708851 0.8631978
2  1  1  0 0.8977670 0.8311355
3  0  1  0 0.7615472 0.6002179
4  1  1  1 0.6478515 0.5616587
5  1  0  0 0.5329256 0.5805195
6  0  1  1 0.8526255 0.4913130
> head(g1res)
   r X1 X2 X3        X4        X5
1: 1  0  1  1 0.4708851 0.4708851
2: 2  1  1  0 0.8977670 0.0000000
3: 3  0  1  0 0.7615472 0.0000000
4: 4  1  1  1 0.6478515 0.6478515
5: 5  1  0  0 0.5329256 0.0000000
6: 6  0  1  1 0.8526255 0.8526255

我还没有足够的data.table向导(几乎没有data.table用户)知道什么是正确的公式.

and I'm not enough of a data.table wizard (barely a data.table user) to know what the correct formulation is.

编译(仅是for循环的好处?)将速度提高了约20%

Compiling (benefits exclusively from the for loop?) increases speed by about 20%

> g0c <- compiler::cmpfun(g0)
> microbenchmark(g0(df), g0c(df), times=10)
Unit: milliseconds
     expr      min      lq     mean   median       uq      max neval
  g0(df)  250.0750 262.941 276.1549 276.8848 281.1966 321.3778    10
  g0c(df) 214.3132 219.940 228.0784 230.2098 235.4579 242.6636    10

这篇关于为什么此循环的时间复杂度是非线性的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆