keras模型中的大数据量 [英] Large data quantities in keras model.predict

查看:64
本文介绍了keras模型中的大数据量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样定义的CNN:

 inputs = keras.Input(shape=(1024,1))
x=inputs

# 1st convolutional block
x = keras.layers.Conv1D(16, kernel_size=(3), name='Conv_1')(x)
x = keras.layers.LeakyReLU(0.1)(x)      
x = keras.layers.MaxPool1D((2), name='MaxPool_1')(x)

x = keras.layers.Flatten(name='Flatten')(x)

# Classifier
x = keras.layers.Dense(64, name='Dense_1')(x)
x = keras.layers.ReLU(name='ReLU_dense_1')(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Dense(64, name='Dense_2')(x)
x = keras.layers.ReLU(name='ReLU_dense_2')(x)
 

我在一个Google colab会话中对其进行了训练,然后打开训练后的模型,并使用keras的model.predict(dataarr)对其进行预测.

问题是我希望能够使用大量数据进行预测,但是数据保存在变得非常大(> 8GB)的.txt文件中,因此Google colab不会有足够的RAM来打开文件并将所有数据读取到单个阵列中.

处理此问题的最佳方法是什么?我正在用C ++生成数据,但我不是专家,但是当我将其写出时必须将数据转换为二进制,而在读取时则必须将其转换回二进制.这是一个明智的选择吗?或者,假设.txt文件中的每行1024行与下一组行无关,是否有一种方法可以使keras进行批量预测?

解决方案

那么什么是输入形状?

来自 keras文档

形状:形状元组(整数),不包括批次大小.例如,shape =(32,)表示期望的输入将是32维向量的批次.该元组的元素可以为None; 无"元素代表形状未知的尺寸.

是什么意思?输入层keras.Input(shape=(1024,1))表示,您将要输入1024个一维向量的向量,即1024个值.如您所正确理解的,输入层中有1024个神经元.但是,单个神经元不适用于输入序列(即线),但可以组合来自上一层神经元的输入及其权重或输入中的单个值.提供的每个下一个值(从序列中得出)仅是另一个独立评估.在此处详细了解神经元.但是,卷积层是NN的特定类型,它使用过滤器和正在尝试在提供的数据中查找模式,期望数据的形状始终相同,例如大小相同的图像或信号的一部分.

如果要提供形状不一致的数据,则有两个选择:

  1. 将数据拆分为适合输入形状的批处理,并选择合理的批处理大小以适合您的RAM,但这可能会导致信息丢失,因为您的数据可能具有连续性,在拆分时会丢失
  2. 使用另一种适用于顺序数据的神经网络-递归神经网络,例如LSTM.这些网络将编码的char/word/value作为单个输入,并通过网络进行处理,并部分存储数据. LSTM网络广泛用于文本分类,并且不需要像大多数NN这样的静态大小的输入.如果还使用带有键集(例如自然文本,源代码等)的数据,则还应该考虑通过哈希映射对数据进行编码(如果尚未这样做).这样可以节省空间,并且使NN更加直观地工作带有数字数据.

请注意,如果您没有非常强大的机器,则根本不想使用如此庞大的数据来训练/测试/执行NN(预计您将拥有如此大小的多个文件),时间复杂度为用如此大的数据进行训练实在太高了,您可能永远也无法获得训练有素的模型.

编辑 经过OP的进一步解释:

以上内容仍然适用,但在这种情况下不适用,将其保留在那里,因为这可能对其他人有所帮助.

关于OP问题,仍应应用批量加载. RAM不会变大,因此需要将数据集分成多个块.一次加载100或1000行应该不会加载太多RAM-您应该尝试找出机器的限制.您可以使用以下代码加载行:

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

文件将在处理后关闭,并且垃圾回收器将从内存中释放行.您可以在ndarray中堆叠行以将其处理为predict()方法.如果没有预测单个样本,则还需要提供batch_size.

您真正需要做的是一次加载n行,完成的线程是inputs = keras.Input(shape=(1024,1)) x=inputs # 1st convolutional block x = keras.layers.Conv1D(16, kernel_size=(3), name='Conv_1')(x) x = keras.layers.LeakyReLU(0.1)(x) x = keras.layers.MaxPool1D((2), name='MaxPool_1')(x) x = keras.layers.Flatten(name='Flatten')(x) # Classifier x = keras.layers.Dense(64, name='Dense_1')(x) x = keras.layers.ReLU(name='ReLU_dense_1')(x) x = keras.layers.Dropout(0.2)(x) x = keras.layers.Dense(64, name='Dense_2')(x) x = keras.layers.ReLU(name='ReLU_dense_2')(x)

I train it in one google colab session, and then I open the trained model and use keras' model.predict(dataarr) to predict results with it.

The problem is that I would like to be able to use large quantities of data to do my predictions with, but the data is saved in .txt files that become very big (>8GB) and therefore google colab doesn't have enough RAM to open the files and read all of the data into a single array.

What's the best way of handling this? I'm producing the data in C++, and I'm not an expert, but it must be possible to convert the data into binary when I write it out and convert it back when I read it. Is this an intelligent option? Or is there a way of getting keras to predict in batches, given that each set of 1024 lines in the .txt file is independent from the next set?

解决方案

So what is an input shape?

From keras documentation

shape: A shape tuple (integers), not including the batch size. For instance, shape=(32,) indicates that the expected input will be batches of 32-dimensional vectors. Elements of this tuple can be None; 'None' elements represent dimensions where the shape is not known.

What does it mean? Your input layer keras.Input(shape=(1024,1)) says, you are going to input 1024 vectors of one dimensional vectors, so 1024 values. As you understand it correctly, there is 1024 neurons in the input layer. A single neuron, however, doesn't work with sequence of inputs (i.e. lines) but can combine inputs from neurons from previous layer and its weights or a single value on input. Every next value provided (as from the sequence) is just an other independent evaluation. Read more about neurons here. However, convolutional layer is specific type of NN, it uses filters and is trying to find patterns in the data provided, expecting always the same shape of the data, such as same sized images or portions of signal.

In case you want to provide data with inconsistent shape, you have two options:

  1. Split data into batches to fit the input shape and choose reasonable batch size to fit your RAM, this might however lead to information loss, since your data might have continuity that will be lost when split
  2. Use another type of neural network suitable for sequential data - recurrent neural networks such as LSTM. These networks take encoded char/word/value as a single input and process it through network with partial memorizing the data. LSTM nets are widely used for text classification and do not require input of static size such as most NN do. You also should think about encoding your data through a hash map (if not done so yet) if you use data with set of keys, such as natural text, source code etc. You save space and it is way more intuitive for NN to work with numerical data.

As a side note, in case you don't have extremely powerful machine, you simply don't want to train/test/execute NN with such huge data (expecting you have multiple files with such size), time complexity of training with data of such size is too high and you might never get your trained model.

EDIT After further explanation from OP:

The above still applies, but not in this case, leaving it there as it might be helpful to somebody else.

About the OPs problem, the batch loading still should be applied. RAM wont get any larger as it is, so splitting the dataset into a chunks is needed. Loading i.e. 100 or 1000 lines at once should not load RAM as much - you should try out to find out where are limits of your machine. You can use the following code to load lines:

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

The file will close after processed and lines will be freed from memory by garbage collector. You can stack lines in ndarray to process them to the predict() method. You need to provide batch_size as well if not predicting a single sample.

EDIT 2:

What you really need to do here is to load n lines at a time, thread where it is done is here. You open the file and load n by n chunks, in example I have provided on example data i have chosen chunks of 2, you can use whatever number you need, e.g. 1000.

from itertools import zip_longest
import numpy as np

n = 2  # Or whatever chunk size you want
with open("file.txt", 'rb') as f:
    for n_lines in zip_longest(*[f]*n, fillvalue=b''):
      arr = np.char.decode(np.array(n_lines),encoding='utf_8')
      print(arr)

The data I have used in sample file are as follow:

1dsds
2sdas
3asdsa
4asdsaad
5asdsaad
6dww
7vcvc
8uku
9kkk1

I have chosen odd count and 2 as chunk size, so you can see that it is appended by empty data, output of the function is following:

['1dsds\n' '2sdas\n']
['3asdsa\n' '4asdsaad\n']
['5asdsaad\n' '6dww\n']
['7vcvc\n' '8uku\n']
['9kkk1' '']

This code loads 2 lines at a time, you can then remove newlines if needed by [s.replace('\n' , '') for s in arr]

To successfully use the data returned use yield and iterate over this function:

from itertools import zip_longest
import numpy as np

def batcher(filename: str):
    n = 2  # Or whatever chunk size you want
    with open(filename, 'rb') as f:
        for n_lines in zip_longest(*[f]*n, fillvalue=b''):
          #decode the loaded byte arrays to strings 
          arr = np.char.decode(np.array(n_lines),encoding='utf_8')
          yield arr.astype(np.float)
for batch_i, arr in enumerate(batcher("file.txt")):
    out = model.predict(arr.reshape( your_shape_comes_here ))
    #do what you need with the predictions

这篇关于keras模型中的大数据量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆