keras模型中的大数据量 [英] Large data quantities in keras model.predict
问题描述
我有一个这样定义的CNN:
inputs = keras.Input(shape=(1024,1))
x=inputs
# 1st convolutional block
x = keras.layers.Conv1D(16, kernel_size=(3), name='Conv_1')(x)
x = keras.layers.LeakyReLU(0.1)(x)
x = keras.layers.MaxPool1D((2), name='MaxPool_1')(x)
x = keras.layers.Flatten(name='Flatten')(x)
# Classifier
x = keras.layers.Dense(64, name='Dense_1')(x)
x = keras.layers.ReLU(name='ReLU_dense_1')(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Dense(64, name='Dense_2')(x)
x = keras.layers.ReLU(name='ReLU_dense_2')(x)
我在一个Google colab会话中对其进行了训练,然后打开训练后的模型,并使用keras的model.predict(dataarr)
对其进行预测.
问题是我希望能够使用大量数据进行预测,但是数据保存在变得非常大(> 8GB)的.txt文件中,因此Google colab不会有足够的RAM来打开文件并将所有数据读取到单个阵列中.
处理此问题的最佳方法是什么?我正在用C ++生成数据,但我不是专家,但是当我将其写出时必须将数据转换为二进制,而在读取时则必须将其转换回二进制.这是一个明智的选择吗?或者,假设.txt文件中的每行1024行与下一组行无关,是否有一种方法可以使keras进行批量预测?
那么什么是输入形状?
来自 keras文档
形状:形状元组(整数),不包括批次大小.例如,shape =(32,)表示期望的输入将是32维向量的批次.该元组的元素可以为None; 无"元素代表形状未知的尺寸.
是什么意思?输入层keras.Input(shape=(1024,1))
表示,您将要输入1024个一维向量的向量,即1024个值.如您所正确理解的,输入层中有1024个神经元.但是,单个神经元不适用于输入序列(即线),但可以组合来自上一层神经元的输入及其权重或输入中的单个值.提供的每个下一个值(从序列中得出)仅是另一个独立评估.在此处详细了解神经元.但是,卷积层是NN的特定类型,它使用过滤器和正在尝试在提供的数据中查找模式,期望数据的形状始终相同,例如大小相同的图像或信号的一部分.
如果要提供形状不一致的数据,则有两个选择:
- 将数据拆分为适合输入形状的批处理,并选择合理的批处理大小以适合您的RAM,但这可能会导致信息丢失,因为您的数据可能具有连续性,在拆分时会丢失
- 使用另一种适用于顺序数据的神经网络-递归神经网络,例如LSTM.这些网络将编码的char/word/value作为单个输入,并通过网络进行处理,并部分存储数据. LSTM网络广泛用于文本分类,并且不需要像大多数NN这样的静态大小的输入.如果还使用带有键集(例如自然文本,源代码等)的数据,则还应该考虑通过哈希映射对数据进行编码(如果尚未这样做).这样可以节省空间,并且使NN更加直观地工作带有数字数据.
请注意,如果您没有非常强大的机器,则根本不想使用如此庞大的数据来训练/测试/执行NN(预计您将拥有如此大小的多个文件),时间复杂度为用如此大的数据进行训练实在太高了,您可能永远也无法获得训练有素的模型.
编辑 经过OP的进一步解释:
以上内容仍然适用,但在这种情况下不适用,将其保留在那里,因为这可能对其他人有所帮助.
关于OP问题,仍应应用批量加载. RAM不会变大,因此需要将数据集分成多个块.一次加载100或1000行应该不会加载太多RAM-您应该尝试找出机器的限制.您可以使用以下代码加载行:
with open("log.txt") as infile:
for line in infile:
do_something_with(line)
文件将在处理后关闭,并且垃圾回收器将从内存中释放行.您可以在ndarray
中堆叠行以将其处理为predict()
方法.如果没有预测单个样本,则还需要提供batch_size
.
您真正需要做的是一次加载n行,完成的线程是 I train it in one google colab session, and then I open the trained model and use keras' The problem is that I would like to be able to use large quantities of data to do my predictions with, but the data is saved in .txt files that become very big (>8GB) and therefore google colab doesn't have enough RAM to open the files and read all of the data into a single array. What's the best way of handling this? I'm producing the data in C++, and I'm not an expert, but it must be possible to convert the data into binary when I write it out and convert it back when I read it. Is this an intelligent option? Or is there a way of getting keras to predict in batches, given that each set of 1024 lines in the .txt file is independent from the next set? So what is an input shape? From keras documentation shape: A shape tuple (integers), not including the batch size. For instance, shape=(32,) indicates that the expected input will be batches of 32-dimensional vectors. Elements of this tuple can be None; 'None' elements represent dimensions where the shape is not known. What does it mean? Your input layer In case you want to provide data with inconsistent shape, you have two options: As a side note, in case you don't have extremely powerful machine, you simply don't want to train/test/execute NN with such huge data (expecting you have multiple files with such size), time complexity of training with data of such size is too high and you might never get your trained model. EDIT
After further explanation from OP: The above still applies, but not in this case, leaving it there as it might be helpful to somebody else. About the OPs problem, the batch loading still should be applied. RAM wont get any larger as it is, so splitting the dataset into a chunks is needed. Loading i.e. 100 or 1000 lines at once should not load RAM as much - you should try out to find out where are limits of your machine. You can use the following code to load lines: The file will close after processed and lines will be freed from memory by garbage collector. You can stack lines in EDIT 2: What you really need to do here is to load n lines at a time, thread where it is done is here. You open the file and load n by n chunks, in example I have provided on example data i have chosen chunks of 2, you can use whatever number you need, e.g. 1000. The data I have used in sample file are as follow:inputs = keras.Input(shape=(1024,1))
x=inputs
# 1st convolutional block
x = keras.layers.Conv1D(16, kernel_size=(3), name='Conv_1')(x)
x = keras.layers.LeakyReLU(0.1)(x)
x = keras.layers.MaxPool1D((2), name='MaxPool_1')(x)
x = keras.layers.Flatten(name='Flatten')(x)
# Classifier
x = keras.layers.Dense(64, name='Dense_1')(x)
x = keras.layers.ReLU(name='ReLU_dense_1')(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Dense(64, name='Dense_2')(x)
x = keras.layers.ReLU(name='ReLU_dense_2')(x)
model.predict(dataarr)
to predict results with it.keras.Input(shape=(1024,1))
says, you are going to input 1024 vectors of one dimensional vectors, so 1024 values. As you understand it correctly, there is 1024 neurons in the input layer. A single neuron, however, doesn't work with sequence of inputs (i.e. lines) but can combine inputs from neurons from previous layer and its weights or a single value on input. Every next value provided (as from the sequence) is just an other independent evaluation. Read more about neurons here. However, convolutional layer is specific type of NN, it uses filters and is trying to find patterns in the data provided, expecting always the same shape of the data, such as same sized images or portions of signal.
with open("log.txt") as infile:
for line in infile:
do_something_with(line)
ndarray
to process them to the predict()
method. You need to provide batch_size
as well if not predicting a single sample.from itertools import zip_longest
import numpy as np
n = 2 # Or whatever chunk size you want
with open("file.txt", 'rb') as f:
for n_lines in zip_longest(*[f]*n, fillvalue=b''):
arr = np.char.decode(np.array(n_lines),encoding='utf_8')
print(arr)
1dsds
2sdas
3asdsa
4asdsaad
5asdsaad
6dww
7vcvc
8uku
9kkk1
I have chosen odd count and 2 as chunk size, so you can see that it is appended by empty data, output of the function is following:
['1dsds\n' '2sdas\n'] ['3asdsa\n' '4asdsaad\n'] ['5asdsaad\n' '6dww\n'] ['7vcvc\n' '8uku\n'] ['9kkk1' '']
This code loads 2 lines at a time, you can then remove newlines if needed by [s.replace('\n' , '') for s in arr]
To successfully use the data returned use yield
and iterate over this function:
from itertools import zip_longest
import numpy as np
def batcher(filename: str):
n = 2 # Or whatever chunk size you want
with open(filename, 'rb') as f:
for n_lines in zip_longest(*[f]*n, fillvalue=b''):
#decode the loaded byte arrays to strings
arr = np.char.decode(np.array(n_lines),encoding='utf_8')
yield arr.astype(np.float)
for batch_i, arr in enumerate(batcher("file.txt")):
out = model.predict(arr.reshape( your_shape_comes_here ))
#do what you need with the predictions
这篇关于keras模型中的大数据量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!