保持Tf-Idf数据 [英] Persist Tf-Idf data

查看:97
本文介绍了保持Tf-Idf数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想存储TF-IDF矩阵,所以我不必一直重新计算它.我正在使用scikit-learn的TfIdfVectorizer.将它腌制或存储在数据库中是否更有效率?

I want to store the TF-IDF matrix so I don't have to recalculate it all the time. I am using scikit-learn's TfIdfVectorizer. Is it more efficient to pickle it or store it in a database?

某些情况:我正在使用k均值聚类来提供文档推荐.由于经常添加新文档,因此我想存储文档的TF-IDF值,以便重新计算聚类.

Some context: I am using k-means clustering to provide document recommendation. Since new documents are added frequently, I would like to store the TF-IDF values of the documents so that I can recalculate the clusters.

推荐答案

酸洗(尤其是使用 joblib.dump )适用于短期存储,例如在交互式会话中保存部分结果或将模型从开发服务器发送到生产服务器.

Pickling (especially using joblib.dump) is good for short term storage, e.g. to save a partial results in an interactive session or ship a model from a development server to a production server.

但是,酸洗格式取决于模型的类定义,这些模型的定义可能会从scikit-learn的一个版本更改为另一个版本.

However the pickling format is dependent on the class definitions of the models that might change from one version of scikit-learn to another.

如果您打算长时间保留该模型并使其有可能在scikit-learn的未来版本中加载,我建议编写自己的与实现无关的持久性模型.

I would recommend to write your own implementation-independant persistence model if you plan to keep the model for a long time and make it possible to load it in future versions of scikit-learn.

我还建议使用HDF5文件格式(例如在PyTables中使用)或其他支持有效存储数字数组的数据库系统.

I would also recommend to use the HDF5 file format (for instance used in PyTables) or other database systems that have some kind of support for storing numerical arrays efficiently.

还查看了scipy.sparse的稀疏矩阵表示形式的内部CSR和COO数据结构,以提出一种有效的方式将它们存储在数据库中.

Also have a look at the internal CSR and COO datastructures for sparse matrix representation of scipy.sparse to come up with an efficient way to store those in a database.

这篇关于保持Tf-Idf数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆