使用Tensorflow.js和tf.Tensor处理大数据的最佳方法是什么? [英] What is the best way to handle large data with Tensorflow.js and tf.Tensor?

查看:111
本文介绍了使用Tensorflow.js和tf.Tensor处理大数据的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 tf.Tensor tf.concat() 处理大量的训练数据,
我发现连续使用 tf.concat()会很慢。
将大数据从文件加载到 tf.Tensor 的最佳方法是什么?

I am using tf.Tensor and tf.concat() to handle large training data, and I found continuous using of tf.concat() gets slow. What is the best way to load large data from file to tf.Tensor?

我认为这是Javascript中按数组处理数据的常用方法。
要实现这一点,这是要执行的粗略步骤。

I think it's common way to handle data by array in Javascript. to achieve that, here is the rough steps to go.


  1. 从文件中读取行

  2. 将行解析为Javascript对象

  3. 将该对象添加到数组中,并按 Array.push()

  4. 在读完每一行之后,我们可以将该数组与for循环一起使用。

  1. read line from file
  2. parse line to Javascript's Object
  3. add that object to array by Array.push()
  4. after finish reading line to end, we can use that array with for loop.

所以我想我可以以与上述类似的方式使用 tf.concat()

so I think I can use tf.concat() in similar way to above.


  1. 从文件中读取行

  2. 将行解析为Javascript对象

  3. 将对象解析为tf.Tensor

  4. 将张量添加到原始张量中,按 tf.concat()

  5. 读完每一行后,我们可以使用该tf.Tensor

  1. read line from file
  2. parse line to Javascript's Object
  3. parse object to tf.Tensor
  4. add tensor to original tensor by tf.concat()
  5. after finish reading line to end, we can use that tf.Tensor



某些代码



下面是一些代码来测量 Array.push()的速度 tf.concat()

import * as tf from "@tensorflow/tfjs"

let t = tf.tensor1d([1])
let addT = tf.tensor1d([2])

console.time()
for (let idx = 0; idx < 50000; idx++) {
    if (idx % 1000 == 0) {
        console.timeEnd()
        console.time()
        console.log(idx)
    }
    t = tf.tidy(() => t.concat(addT))
}


let arr = []
let addA = 1
console.time()
for (let idx = 0; idx < 50000; idx++) {
    if (idx % 1000 == 0) {
        console.timeEnd()
        console.time()
        console.log(idx)
    }
    arr.push(addA)
}



测量



我们可以看到 Array的稳定过程。 push()
,但在 tf.concat()

default: 0.150ms
0
default: 68.725ms
1000
default: 62.922ms
2000
default: 23.199ms
3000
default: 21.093ms
4000
default: 27.808ms
5000
default: 39.689ms
6000
default: 34.798ms
7000
default: 45.502ms
8000
default: 94.526ms
9000
default: 51.996ms
10000
default: 76.529ms
11000
default: 83.662ms
12000
default: 45.730ms
13000
default: 89.119ms
14000
default: 49.171ms
15000
default: 48.555ms
16000
default: 55.686ms
17000
default: 54.857ms
18000
default: 54.801ms
19000
default: 55.312ms
20000
default: 65.760ms



对于Array.push()



For Array.push()

default: 0.009ms
0
default: 0.388ms
1000
default: 0.340ms
2000
default: 0.333ms
3000
default: 0.317ms
4000
default: 0.330ms
5000
default: 0.289ms
6000
default: 0.299ms
7000
default: 0.291ms
8000
default: 0.320ms
9000
default: 0.284ms
10000
default: 0.343ms
11000
default: 0.327ms
12000
default: 0.317ms
13000
default: 0.329ms
14000
default: 0.307ms
15000
default: 0.218ms
16000
default: 0.193ms
17000
default: 0.234ms
18000
default: 1.943ms
19000
default: 0.164ms
20000
default: 0.148ms


推荐答案

尽管没有创建张量的单一方法,但问题的答案在于如何处理创建的张量。

Though there is not a single way of creating a tensor, the answer of the questions lies to what is done with the tensors created.

张量是不可变的,因此每次 tf.concat 被称为创建了一个新的张量。

tensors are immutable, therefore each time, tf.concat is called a new tensor is created.

let x = tf.tensor1d([2]);
console.log(tf.memory()) // "numTensors": 1
const y = tf.tensor1d([3])
x = tf.concat([x, y])
console.log(tf.memory()) // "numTensors": 3, 

<html>
  <head>
    <!-- Load TensorFlow.js -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.14.1"> </script>
  </head>

  <body>
  </body>
</html>

从上面的代码段可以看到,调用tf.concat时创建的张量数量为 3 而不是 2 。的确, tf.tidy 将处理未使用的张量。但是随着创建的张量越来越大,这种创建和处理张量的操作将变得最昂贵。这是内存消耗和计算的问题,因为创建新的张量将始终委托给后端。

As we can see from the snippet above, the number of tensors that is created when tf.concat is called is 3 and not 2 . It is true that tf.tidy will dispose of unused tensors. But this operation of creating and disposing of tensors will become most and most costly as the created tensor is getting bigger and bigger. This is both an issue of memory consumption and computation since creating a new tensor will always delegate to a backend.

现在,人们已经理解了性能问题,最好的处理方法是什么?

Now that the issue of performance is understood, what is the best way to proceed ?


  • 在js中创建整个数组,并在整个数组完成后创建张量。

for (i= 0; i < data.length; i++) {
  // fill array x
  x.push(dataValue)
}
// create the tensor
tf.tensor(x)

尽管这是微不足道的解决方案,但并非总是可行。因为创建数组会将数据保留在内存中,所以我们可以很容易地用大数据条目耗尽内存。因此,有时最好不要创建整个javascript数组来创建数组块,并从这些数组块中创建张量,并在创建后立即开始处理这些张量。如有必要,可以再次使用 tf.concat 合并块张量。

Though, it is the trivial solution, it is not always possible. Because create an array will keep data in memory and we can easily run out of memory with big data entries. Therefore sometimes, it might be best instead of creating the whole javascript array to create chunk of arrays and create a tensor from those chunk of arrays and start to process those tensors as soon as they are created. The chunk tensors can be merged using tf.concat again if necessary. But it might not always be required.

例如,我们可以使用张量块反复调用model.fit(),而不用一次大张量调用一次。渴望创造。在这种情况下,不需要连接块张量。

For instance we can call model.fit() repeatedly using chunk of tensors instead of calling it once with a big tensor that might take long to create. In this case, there is no need to concatenate the chunk tensors.


  • 如果可能,请使用tf.data创建数据集。如果我们接下来要用数据拟合模型,这是理想的解决方案。

function makeIterator() {

  const iterator = {
    next: () => {
      let result;
      if (index < data.length) {
        result = {value: dataValue, done: false};
        index++;
        return result;
      }
      return {value: dataValue, done: true};
    }
  };
  return iterator;
}
const ds = tf.data.generator(makeIterator);

使用tf.data的优势在于,整个数据集是在<$期间需要时按批次创建的c $ c> model.fit 调用。

The advantage of using tf.data is that the whole dataset is created by batches when needed during model.fit call.

这篇关于使用Tensorflow.js和tf.Tensor处理大数据的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆