保留TFIDF结果以使用Scikit for Python预测新内容 [英] Keep TFIDF result for predicting new content using Scikit for Python

查看:264
本文介绍了保留TFIDF结果以使用Scikit for Python预测新内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在python上使用sklearn进行一些聚类.我已经训练了200,000个数据,下面的代码效果很好.

I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.

corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)

但是,当我有新的测试内容时,我想将其群集到我训练过的现有群集中.因此,我想知道如何保存IDF结果,以便对新的测试内容执行TFIDF并确保新测试内容的结果具有相同的数组长度.

But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.

谢谢.

更新

如果其中一个包含经过训练的IDF结果,我可能需要将"transformer"或"tfidf"变量保存到文件(txt或其他文件)中.

I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.

更新

例如.我有训练数据:

["a", "b", "c"]
["a", "b", "d"]

然后执行TFIDF,结果将包含4个要素(a,b,c,d)

And do TFIDF, the result will contains 4 features(a,b,c,d)

当我测试时:

["a", "c", "d"]

查看它属于哪个集群(已经由k-means制作). TFIDF将仅给出具有3个特征(a,c,d)的结果,因此k均值的聚类将下降. (如果我测试["a", "b", "e"],可能还有其他问题.)

to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"], there may have other problems.)

那么如何存储用于测试数据的功能列表(甚至将其存储在文件中)?

So how to store the features list for testing data (even more, store it in file)?

更新

已解决,请参见下面的答案.

Solved, see answers below.

推荐答案

我通过保存vectorizer.vocabulary_成功保存了功能列表,并由CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

以下代码:

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

那行得通. tfidf将具有与训练数据相同的特征长度.

That works. tfidf will have same feature length as trained data.

这篇关于保留TFIDF结果以使用Scikit for Python预测新内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆