在sklearn中保留数据 [英] Persisting data in sklearn

查看:90
本文介绍了在sklearn中保留数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit-learn对文本文档进行聚类.我正在使用CountVectorizer,TfidfTransformer和MiniBatchKMeans类来帮助我做到这一点. 新的文本文档一直被添加到系统中,这意味着我需要使用上面的类来转换文本并预测聚类.我的问题是:如何将数据存储在磁盘上? 我应该简单地腌制矢量化器,变压器和kmeans对象吗? 我应该只保存数据吗?如果是这样,如何将其重新添加到矢量化器,转换器和kmeans对象中?

I'm using scikit-learn to cluster text documents. I'm using the classes CountVectorizer, TfidfTransformer and MiniBatchKMeans to help me do that. New text documents are added to the system all the time, which means that I need to use the classes above to transform the text and predict a cluster. My question is: how should I store the data on disk? Should I simply pickle the vectorizer, transformer and kmeans objects? Should I just save the data? If so, how do I add it back to the vectorizer, transformer and kmeans objects?

任何帮助将不胜感激

推荐答案

这取决于您要执行的操作.

It depends on what you want to do.

如果要在训练集中找到一些固定的聚类中心,然后在以后重新使用它们来为新数据计算聚类分配,则对模型进行酸洗(或仅保存向量化器的词汇表以及其他模型的构造函数参数和群集中心位置).

If you want to find some fixed cluster centers on a training set and then re-use them later to compute cluster assignments for new data then pickling the models (or just saving the vocabulary of the vectorizer and the other models constructors parameters and the cluster center positions) is ok.

如果您要对新数据进行聚类,则可能需要使用新数据+旧数据的并集来重新训练整个管道,以使矢量化器的词汇表能够构建新功能(尺寸)输入新单词,然后让聚类算法找到与整个数据集的结构更匹配的聚类中心.

If what you want is doing clustering with new data, you might want to retrain the whole pipeline using the union of the new data + the old data to make it possible for the vocabulary of the vectorizer to build new features (dimensions) for the new words and let the clustering algorithm find cluster centers that better match the structure of the complete dataset.

请注意,将来我们将提供散列矢量化程序(例如,参见作为第一个构建块,对哈希转换器发出请求" ),因此不再需要存储词汇表(但是您将失去反省特征维的含义"的能力).

Note that in the future we will provide hashing vectorizers (see for instance this pull request on hashing transformers as a first building block), hence storing the vocabulary won't be necessary any more (but you will loose the ability to introspect the "meaning" of the feature dimensions).

关于酸洗模型与使用您自己的表示作为参数,我已经在您之前的问题中在这里回答了这一部分:

As for pickling the models vs using your own representation for their parameters I have answered this part in your previous question here: Persist Tf-Idf data

这篇关于在sklearn中保留数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆