如何消除一个数据表上的循环? [英] how can I eliminate a loop over a datatable?

查看:109
本文介绍了如何消除一个数据表上的循环?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 data.table ,如下所示:

N = 10
A.DT <- data.table(a1 = c(rnorm(N,0,1)), a2 = NA))
B.DT <- data.table(b1 = c(rnorm(N,0,1)), b2 = 1:N)
setkey(A.DT,a1)    
setkey(B.DT,b1)

我试图将我之前的 data.frame data.table 通过更改for循环执行,如下所示:

I tried to change my previous data.frame implementation to a data.table implementation by changing the for-loop as shown below:

for (i in 1:nrow(B.DT)) {
  for (j in nrow(A.DT):1) {
    if (B.DT[i,b2] <= N/2 
        && B.DT[i,b1] < A.DT[j,a1]) {
      A.DT[j,]$a2 <- B.DT[i,]$b1
      break
    }
  }
} 

我收到以下错误信息:

Error in `[<-.data.table`(`*tmp*`, j, a2, value = -0.391987468746123) : 
  object "a2" not found


$ b b

我认为访问 data.table 的方式是不正确的。我是新的。我想有一个更快的方式做比循环上下两个数据表。

I think the way I access data.table is not quite right. I am new to it. I guess there is a quicker way of doing it than cycling up and down the two datatables.

我想知道上面显示的循环是否可以简化/矢量化。

I'd like to know if the loop shown above could be simplified/vectorised.

编辑复制/粘贴的data.table数据:

Edit The data.table data for copy/paste:

# A.DT
    a1  a2
1   -1.4917779  NA
2   -1.0731161  NA
3   -0.7533091  NA
4   -0.3673273  NA
5   -0.159569   NA
6   -0.1551948  NA
7   -0.0430574  NA
8   0.1783496   NA
9   0.4276034   NA
10  1.0697412   NA

# B.DT
    b1  b2
1   0.64229018  1
2   1.00527902  2
3   0.24746294  3
4   -0.50288835 4
5   0.34447791  5
6   -0.22205129 6
7   0.60099079  7
8   -0.70242284 8
9   0.6298599   9
10  0.08917988  10

# OUTPUT
    a1  a2
1   -1.4917779  NA
2   -1.0731161  NA
3   -0.7533091  NA
4   -0.3673273  NA
5   -0.159569   NA
6   -0.1551948  NA
7   -0.0430574  NA
8   0.1783496   -0.50288835
9   0.4276034   0.24746294
10  1.0697412   0.64229018

算法下降一个表,行向上移动另一个表,检查一些条件并相应地修改值。更具体地,它向下B.DT,并且对于B.DT中的每一行向上增加A.DT,并将b1的第一个值赋予a2,使得b1小于a1。在赋值之前检查附加条件(在该示例中b2等于或小于5)。

The algorithm goes down one table, and for each row go up the other table, check some conditions and modify values accordingly. More specifically, it goes down B.DT, and for each row in B.DT goes up A.DT and assigns to a2 the first value of b1 such that b1 is smaller than a1. An additional condition is checked before assignment (b2 being equal or smaller than 5 in this example).

0.64229018是B.DT中的第一个值,它被分配给A.DT的最后一个单位。
1.00527902是B.DT中的第二个值,但它是未分配的,因为它大于A.DT中的所有其他值。
0.24746294是B.DT中的第三个值,它被分配给A.DT中的倒数第二个单元。
-0.50288835是B.DT中的第四个值,它被分配给A.DT中的单元#8
0.34447791是B.DT中的第五个值,因为它也是未分配的大。

0.64229018 is the first value in B.DT, and it is assigned to the last unit of A.DT. 1.00527902 is the second value in B.DT, but it is left unassigned because it is bigger than all other values in A.DT. 0.24746294 is the third value in B.DT, and it is assigned to the second last unit in A.DT. -0.50288835 is the fourth value in B.DT, and it is assigned to unit #8 in A.DT 0.34447791 is the fifth value in B.DT, and it is left unassigned because it is too big.

这当然是一个简化的问题(因此可能没有多大意义)。感谢您的时间和输入。

This is of course a simplified problem (and therefore may not make much sense). Thanks for your time and input.

推荐答案

一旦创建了data.table, < - ,而是要使用:= 位于 j 中的括号。
(避免< - 的原因是< - 创建对象的副本,而:= 不会,因此效率)

Once you have created your data.table, there is little need for the regular assign operator <-, instead you want to use :=, and this goes inside of the brackets in the j location. (the reason for avoiding <- is that <- creates a copy of the object, whereas := does not, hence the efficiency)

所以首先修改你的代码将是:

So first modification to your code would be:

 # FROM: A.DT[j,]$a2 <- B.DT[i,]$b1
 # TO: 
 A.DT[j, a2 := B.DT[i, b1] ]



<$>

现在, data.table 的(许多)最佳功能之一是它的 参数,这有助于消除大量 for 循环和 * ply 调用。
在这种情况下,您可以按如下方式清理双循环:


Now, one of data.table's (many) best features is it's by argument, which helps do away with a lot of for loops and *ply calls. In this specific case, you can clean up your dual loops as follows:

set.seed(201)
A.DT <- data.table(a1 = rnorm(N,0,1), key="a1")  # no need to create a2 if it will be NA. If you do, make sure it is as.numeric(NA)
B.DT <- data.table(b1 = rnorm(N,0,1), b2 = 1:N, key="b2")

# Assign to a2 in A.DT
A.DT[            
      , a2 := B.DT[ b2 <= N/2 & b1 < a1] [1, b1]
      , by=a1
     ]


> A.DT
             a1         a2
 1: -2.30403431         NA
 2: -1.69658097         NA
 3: -1.28548252         NA
 4: -0.34454603 -0.6478531
 5: -0.07503189 -0.6478531
 6:  0.05593404 -0.6478531
 7:  0.18900414 -0.6478531
 8:  0.26693735  0.2238094
 9:  0.28606069  0.2238094
10:  0.32576373  0.2238094






键上的两个Sidenote s。




  • 您可以在创建data.table的同时设置键,代码行

  • data.table按其键排序。根据你使用行位置来确定赋值的事实,我猜你不会像你一样设置键。在上面的代码中,我将 B.DT 的键更改为`b2。


  • Two Sidenote on keys.

    • you can set the key at the same time as you are creating the data.table, saving you two lines of code
    • a data.table is sorted by its key. Judging by the fact that you are using row position to determine assignment, I am guessing that you will not want to set the keys as you have. In the code above, I changed B.DT's key to `b2.
    • 这篇关于如何消除一个数据表上的循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆