将稀疏矩阵与Keras和Tensorflow结合使用 [英] Using sparse matrices with Keras and Tensorflow

查看:431
本文介绍了将稀疏矩阵与Keras和Tensorflow结合使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据可以看作是10B个条目的矩阵(100M x 100),它非常稀疏(<1/100 * 1/100个条目为非零).我想使用Tensorflow后端将数据输入到我制作的Keras神经网络模型中.

My data can be viewed as a matrix of 10B entries (100M x 100), which is very sparse (< 1/100 * 1/100 of entries are non-zero). I would like to feed the data into into a Keras Neural Network model which I have made, using a Tensorflow backend.

我的第一个想法是将数据扩展为密集的,即将所有10B条目写出为一系列CSV,大多数条目为零.但是,这很快使我的资源不堪重负(即使执行ETL也使熊猫不堪重负,并导致Postgres挣扎).所以我需要使用真正的稀疏矩阵.

My first thought was to expand the data to be dense, that is, write out all 10B entries into a series of CSVs, with most entries zero. However, this is quickly overwhelming my resources (even doing the ETL overwhelmed pandas and is causing postgres to struggle). So I need to use true sparse matrices.

我该如何使用Keras(和Tensorflow)来做到这一点?虽然numpy不支持稀疏矩阵,但scipy和tensorflow都支持.有很多讨论(例如 https://github.com/fchollet/keras/pull/1886 https://github.com/fchollet/keras/pull/3695/files https://github.com/pplonski/keras-sparse-check https://groups.google.com/forum/#!topic/keras-users /odsQBcNCdZg ),或者使用scipy的稀疏矩阵或直接进入Tensorflow的稀疏矩阵.但是我找不到明确的结论,也无法使任何事情发挥作用(甚至还不知道该走哪条路!).

How can I do that with Keras (and Tensorflow)? While numpy doesn't support sparse matrices, scipy and tensorflow both do. There's lots of discussion (e.g. https://github.com/fchollet/keras/pull/1886 https://github.com/fchollet/keras/pull/3695/files https://github.com/pplonski/keras-sparse-check https://groups.google.com/forum/#!topic/keras-users/odsQBcNCdZg ) about this idea - either using scipy's sparse matrixcs or going directly to Tensorflow's sparse matrices. But I can't find a clear conclusion, and I haven't been able to get anything to work (or even know clearly which way to go!).

我该怎么做?

我相信有两种可能的方法:

I believe there are two possible approaches:

  1. 将其保留为稀疏稀疏矩阵,然后在给Keras微型批处理时,使其密实
  2. 一直保持稀疏状态,并使用Tensorflow稀疏张量

我还认为#2是首选,因为在整个过程中(我相信)您会获得更好的性能,但是#1可能会更容易并且足够.我会满意的.

I also think #2 is preferred, because you'll get much better performance all the way through (I believe), but #1 is probably easier and will be adequate. I'll be happy with either.

怎么实现?

推荐答案

对不起,没有信誉的评论,但是我认为您应该在此处查看答案:Keras,稀疏矩阵问题.我已经尝试过了,并且它可以正常工作,尽管只有一个音符,至少在我的情况下,混洗导致了非常糟糕的结果,所以我使用了这种经过稍微修改的非混洗替代方法:

Sorry, don't have the reputation to comment, but I think you should take a look at the answer here: Keras, sparse matrix issue. I have tried it and it works correctly, just one note though, at least in my case, the shuffling led to really bad results, so I used this slightly modified non-shuffled alternative:

def nn_batch_generator(X_data, y_data, batch_size):
    samples_per_epoch = X_data.shape[0]
    number_of_batches = samples_per_epoch/batch_size
    counter=0
    index = np.arange(np.shape(y_data)[0])
    while 1:
        index_batch = index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X_data[index_batch,:].todense()
        y_batch = y_data[index_batch]
        counter += 1
        yield np.array(X_batch),y_batch
        if (counter > number_of_batches):
            counter=0

它产生的精度与keras的改组实现(在fit中设置shuffle=True)所实现的精度相当.

It produces comparable accuracies to the ones achieved by keras's shuffled implementation (setting shuffle=True in fit).

这篇关于将稀疏矩阵与Keras和Tensorflow结合使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆