在 Keras 和 Tensorflow 中使用稀疏矩阵 [英] Using sparse matrices with Keras and Tensorflow

查看:40
本文介绍了在 Keras 和 Tensorflow 中使用稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据可以被视为一个由 10B 个条目 (100M x 100) 组成的矩阵,它非常稀疏(<1/100 * 1/100 的条目是非零的).我想使用 Tensorflow 后端将数据输入到我制作的 Keras 神经网络模型中.

My data can be viewed as a matrix of 10B entries (100M x 100), which is very sparse (< 1/100 * 1/100 of entries are non-zero). I would like to feed the data into into a Keras Neural Network model which I have made, using a Tensorflow backend.

我的第一个想法是将数据扩展为密集型,即将所有 10B 个条目写入一系列 CSV,其中大多数条目为零.然而,这很快就会让我的资源不堪重负(即使做 ETL 也让 Pandas 不堪重负,并导致 postgres 陷入困境).所以我需要使用真正的稀疏矩阵.

My first thought was to expand the data to be dense, that is, write out all 10B entries into a series of CSVs, with most entries zero. However, this is quickly overwhelming my resources (even doing the ETL overwhelmed pandas and is causing postgres to struggle). So I need to use true sparse matrices.

我怎样才能用 Keras(和 Tensorflow)做到这一点?虽然 numpy 不支持稀疏矩阵,但 scipy 和 tensorflow 都支持.有很多讨论(例如 https://github.com/fchollet/keras/pull/1886 https://github.com/fchollet/keras/pull/3695/files https://github.com/pplonski/keras-sparse-check https://groups.google.com/forum/#!topic/keras-users/odsQBcNCdZg )关于这个想法 - 要么使用 scipy 的稀疏矩阵,要么直接使用 Tensorflow 的稀疏矩阵.但是我找不到一个明确的结论,我一直无法得到任何工作(甚至清楚地知道该走哪条路!).

How can I do that with Keras (and Tensorflow)? While numpy doesn't support sparse matrices, scipy and tensorflow both do. There's lots of discussion (e.g. https://github.com/fchollet/keras/pull/1886 https://github.com/fchollet/keras/pull/3695/files https://github.com/pplonski/keras-sparse-check https://groups.google.com/forum/#!topic/keras-users/odsQBcNCdZg ) about this idea - either using scipy's sparse matrixcs or going directly to Tensorflow's sparse matrices. But I can't find a clear conclusion, and I haven't been able to get anything to work (or even know clearly which way to go!).

我该怎么做?

我认为有两种可能的方法:

I believe there are two possible approaches:

  1. 将其保留为 scipy 稀疏矩阵,然后在给 Keras 一个 minibatch 时,使其密集
  2. 一直保持稀疏,并使用 Tensorflow Sparse Tensors

我也认为 #2 是首选,因为你会在整个过程中获得更好的性能(我相信),但 #1 可能更容易并且足够了.我会很高兴.

I also think #2 is preferred, because you'll get much better performance all the way through (I believe), but #1 is probably easier and will be adequate. I'll be happy with either.

如何实现?

推荐答案

对不起,没有评论的声誉,但我认为你应该看看这里的答案:Keras,稀疏矩阵问题.我已经尝试过它并且它工作正常,但只有一个注释,至少在我的情况下,改组导致非常糟糕的结果,所以我使用了这个稍微修改过的非改组替代方案:

Sorry, don't have the reputation to comment, but I think you should take a look at the answer here: Keras, sparse matrix issue. I have tried it and it works correctly, just one note though, at least in my case, the shuffling led to really bad results, so I used this slightly modified non-shuffled alternative:

def nn_batch_generator(X_data, y_data, batch_size):
    samples_per_epoch = X_data.shape[0]
    number_of_batches = samples_per_epoch/batch_size
    counter=0
    index = np.arange(np.shape(y_data)[0])
    while 1:
        index_batch = index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X_data[index_batch,:].todense()
        y_batch = y_data[index_batch]
        counter += 1
        yield np.array(X_batch),y_batch
        if (counter > number_of_batches):
            counter=0

它产生的精度与 keras 的混洗实现(在 fit 中设置 shuffle=True)所达到的精度相当.

It produces comparable accuracies to the ones achieved by keras's shuffled implementation (setting shuffle=True in fit).

这篇关于在 Keras 和 Tensorflow 中使用稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆