腌制稀疏稀疏矩阵作为输入数据? [英] Pickled scipy sparse matrix as input data?

查看:117
本文介绍了腌制稀疏稀疏矩阵作为输入数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究将简历分类的多类分类问题.

I am working on a multiclass classification problem consisting in classifying resumes.

我使用sklearn及其TfIdfVectorizer获得了一个大的稀疏矩阵,将其腌制后输入Tensorflow模型.在我的本地计算机上,我将其加载,将一小部分转换为密集的numpy数组,并填充feed字典.一切正常.

I used sklearn and its TfIdfVectorizer to get a big scipy sparse matrix that I feed in a Tensorflow model after pickling it. On my local machine, I load it, convert a small batch to dense numpy arrays and fill a feed dictionnary. Everything works great.

现在,我想在ML云上做同样的事情.我的泡菜存储在gs://my-bucket/path/to/pickle,但是当我运行培训师时,在此URI(IOError: [Errno 2] No such file or directory)上找不到泡菜文件.我正在使用pickle.load(open('gs://my-bucket/path/to/pickle), 'rb')提取我的数据.我怀疑这不是在GCS上打开文件的好方法,但是我对Google Cloud完全陌生,我找不到合适的方法.

Now I would like to do the same thing on ML cloud. My pickle is stored at gs://my-bucket/path/to/pickle but when I run my trainer, the pickle file can't be found at this URI (IOError: [Errno 2] No such file or directory). I am using pickle.load(open('gs://my-bucket/path/to/pickle), 'rb') to extract my data. I suspect that this is not the good way to open a file on GCS but I'm totally new to Google Cloud and I can't find the proper way to do so.

此外,我读到一个人必须对输入数据使用TFRecords或CSV格式,但是我不明白为什么我的方法不起作用.由于矩阵的密集表示太大而无法容纳在内存中,因此排除了CSV. TFRecords可以像这样有效地编码稀疏数据吗?可以从泡菜文件中读取数据吗?

Also, I read that one must use TFRecords or a CSV format for input data but I don't understand why my method could not work. CSV is excluded since the dense representation of the matrix would be too big to fit in memory. Can TFRecords encode efficiently sparse data like that? And is it possible to read data from a pickle file?

推荐答案

您是正确的,Python的"open"将无法与GCS配合使用.假设您正在使用TensorFlow,则可以改用file_io库,该库将与本地文件以及GCS上的文件一起使用.

You are correct that Python's "open" won't work with GCS out of the box. Given that you're using TensorFlow, you can use the file_io library instead, which will work both with local files as well as files on GCS.

from tensorflow.python.lib.io import file_io
pickle.loads(file_io.read_file_to_string('gs://my-bucket/path/to/pickle'))

注意:pickle.load(file_io.FileIO('gs://..', 'r'))似乎不起作用.

欢迎您使用对您有用的任何数据格式,并且不限于CSV或TFRecord(您介意指出提出该声明的文档中的位置吗?).如果数据适合内存,那么您的方法是明智的.

You are welcome to use whatever data format works for you and are not limited to CSV or TFRecord (do you mind pointing to the place in the documentation that makes that claim?). If the data fits in memory, then your approach is sensible.

如果数据不适合存储在内存中,则可能要使用TensorFlow的阅读器框架,其中最方便的是CSV或TFRecords. TFRecord只是字节字符串的容器.最常见的是,它包含序列化的 tf.Example 确实支持稀疏数据的数据(本质上是一个映射).有关解析的更多信息,请参见 tf.parse_example tf.示例数据.

If the data doesn't fit in memory, you will likely want to use TensorFlow's reader framework, the most convenient of which tend to be CSV or TFRecords. TFRecord is simply a container of byte strings. Most commonly, it contains serialized tf.Example data which does support sparse data (it is essentially a map). See tf.parse_example for more information on parsing tf.Example data.

这篇关于腌制稀疏稀疏矩阵作为输入数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆