使用 Scikit for Python 保留 TFIDF 结果以预测新内容 [英] Keep TFIDF result for predicting new content using Scikit for Python

查看:35
本文介绍了使用 Scikit for Python 保留 TFIDF 结果以预测新内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Python 上使用 sklearn 来做一些聚类.我已经训练了 200,000 个数据,下面的代码运行良好.

I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.

corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)

但是当我有新的测试内容时,我想将它聚集到我训练过的现有集群中.所以我想知道如何保存 IDF 结果,以便我可以对新的测试内容进行 TFIDF,并确保新测试内容的结果具有相同的数组长度.

But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.

提前致谢.

更新

如果其中一个包含经过训练的 IDF 结果,我可能需要将transformer"或tfidf"变量保存到文件(txt 或其他)中.

I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.

更新

例如.我有训练数据:

["a", "b", "c"]
["a", "b", "d"]

做TFIDF,结果将包含4个特征(a,b,c,d)

And do TFIDF, the result will contains 4 features(a,b,c,d)

当我测试时:

["a", "c", "d"]

查看它属于哪个集群(已经由 k-means 创建).TFIDF 只会给出具有 3 个特征 (a,c,d) 的结果,因此 k-means 中的聚类会下降.(如果我测试["a", "b", "e"],可能还有其他问题.)

to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"], there may have other problems.)

那么如何存储测试数据的特征列表(更甚者,存储在文件中)?

So how to store the features list for testing data (even more, store it in file)?

更新

已解决,请参阅下面的答案.

Solved, see answers below.

推荐答案

我通过保存vectorizer.vocabulary_成功保存了特征列表,并通过CountVectorizer(decode_error="replace",词汇=vectorizer.vocabulary_)

I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

以下代码:

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

那行得通.tfidf 将具有与训练数据相同的特征长度.

That works. tfidf will have same feature length as trained data.

这篇关于使用 Scikit for Python 保留 TFIDF 结果以预测新内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆