无监督聚类期间如何在sklearn的TfidfVectorizer中选择参数 [英] how to choose parameters in TfidfVectorizer in sklearn during unsupervised clustering

查看:508
本文介绍了无监督聚类期间如何在sklearn的TfidfVectorizer中选择参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TfidfVectorizer提供了一种简便的方法来对&将文本转换为矢量.

TfidfVectorizer provides an easy way to encode & transform texts into vectors.

我的问题是如何为min_df,max_features,smooth_idf,sublinear_tf等参数选择合适的值?

My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?

更新:

也许我应该在这个问题上提供更多细节:

Maybe I should have put more details on the question:

如果我要对一堆文本进行无监督聚类怎么办.而且我的文字& ;;没有任何标签.我不知道可能有多少个群集(这实际上是我要弄清楚的)

What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)

推荐答案

例如,如果您在分类任务中使用这些向量,则可以更改这些参数(当然也可以更改分类器的参数),然后查看哪些值可为您带来最佳性能.

If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.

您可以使用GridSearchCV和Pipeline对象在sklearn中轻松实现

You can do that in sklearn easily with the GridSearchCV and Pipeline objects

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('clf', OneVsRestClassifier(MultinomialNB(
        fit_prior=True, class_prior=None))),
])
parameters = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'clf__estimator__alpha': (1e-2, 1e-3)
}

grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)

print("Best parameters set:")
print grid_search_tune.best_estimator_.steps

这篇关于无监督聚类期间如何在sklearn的TfidfVectorizer中选择参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆