'管道'对象没有属性 'get_feature_names'在 scikit-learn 中 [英] 'Pipeline' object has no attribute 'get_feature_names' in scikit-learn

查看:75
本文介绍了'管道'对象没有属性 'get_feature_names'在 scikit-learn 中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我基本上是使用 mini_batch_kmeans 和 kmeans 算法对我的一些文档进行聚类.我只是按照教程是 scikit-learn 网站,其链接如下:http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

I am basically clustering some of my documents using mini_batch_kmeans and kmeans algorithm. I simply followed the tutorial is the scikit-learn website the link for that is given below: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

他们正在使用一些方法进行矢量化,其中之一是 HashingVectorizer.在 hashingVectorizer 中,他们使用 TfidfTransformer() 方法制作管道.

They are using some of the method for the vectorizing one of which is HashingVectorizer. In the hashingVectorizer they are making a pipeline with TfidfTransformer() method.

# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
                               stop_words='english', non_negative=True,
                               norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())

一旦这样做,我从中得到的矢量化器就没有方法 get_feature_names().但是由于我将它用于聚类,因此我需要使用此get_feature_names()"来获取术语"

Once doing so, the vectorizer what I get from that does not have the method get_feature_names(). But since I am using it for clustering, I need to get the "terms" using this "get_feature_names()"

terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

我该如何解决这个错误?

How do I solve this error?

我的整个代码如下所示:

My whole code is show below:

X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
                                                vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)

计数向量化器通过 tfidf 进行管道传输.

The count vectorizor piped with tfidf.

def count_tfidf_vectorizer(self,contents):
    count_vect = CountVectorizer()
    vectorizer = make_pipeline(count_vect,TfidfTransformer())
    X_train_vecs = vectorizer.fit_transform(contents)
    print("The count of bow : ", X_train_vecs.shape)
    return X_train_vecs, vectorizer

和 mini_batch_kmeans 类如下:

and the mini_batch_kmeans class is as below:

class MiniBatchKmeansTechnique():
    def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
                              filenames, contents, svd=None, is_dimension_reduced=True):
        km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
                         init_size=1000, batch_size=1000, verbose=True, random_state=42)
        print("Clustering sparse data with %s" % km)
        t0 = time()
        km.fit(X_train_vecs)
        print("done in %0.3fs" % (time() - t0))
        print()
        cluster_labels = km.labels_.tolist()
        print("List of the cluster names is : ",cluster_labels)
        data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
        frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
        print(frame['cluster_label'].value_counts(sort=True,ascending=False))
        print()
        grouped = frame['cluster_label'].groupby(frame['cluster_label'])
        print(grouped.mean())
        print()
        print("Top Terms Per Cluster :")

        if is_dimension_reduced:
            if svd != None:
                original_space_centroids = svd.inverse_transform(km.cluster_centers_)
                order_centroids = original_space_centroids.argsort()[:, ::-1]
        else:
            order_centroids = km.cluster_centers_.argsort()[:, ::-1]

        terms = vectorizer.get_feature_names()
        for i in range(number_cluster):
            print("Cluster %d:" % i, end=' ')
            for ind in order_centroids[i, :10]:
                print(' %s' % terms[ind], end=',')
            print()
            print("Cluster %d filenames:" % i, end='')
            for file in frame.ix[i]['filename'].values.tolist():
                print(' %s,' % file, end='')
            print()

推荐答案

Pipeline 没有 get_feature_names() 方法,因为为 Pipeline 实现这个方法并不简单 - 需要考虑所有流水线步骤来获取特征名称.请参阅 https://github.com/scikit-learn/scikit-learn/issues/6424, https://github.com/scikit-learn/scikit-learn/issues/6425 等 - 有很多相关的票证和多次尝试修复它.

Pipeline doesn't have get_feature_names() method, as it is not straightforward to implement this method for Pipeline - one needs to consider all pipeline steps to get feature names. See https://github.com/scikit-learn/scikit-learn/issues/6424, https://github.com/scikit-learn/scikit-learn/issues/6425, etc. - there is a lot of related tickets and several attempts to fix it.

如果您的管道很简单(TfidfVectorizer 后跟 MiniBatchKMeans),那么您可以从 TfidfVectorizer 获取特征名称.

If your pipeline is simple (TfidfVectorizer followed by MiniBatchKMeans) then you can get feature names from TfidfVectorizer.

如果你想使用 HashingVectorizer,那就更复杂了,因为 HashingVectorizer 在设计上不提供特征名称.HashingVectorizer 不存储词汇,而是使用散列 - 这意味着它可以应用于在线设置,并且它不需要任何 RAM - 但权衡正是你没有得到特征名称.

If you want to use HashingVectorizer, it is more complicated, as HashingVectorizer doesn't provide feature names by design. HashingVectorizer doesn't store vocabulary, and uses hashes instead - it means it can be applied in online setting, and that it dosn't require any RAM - but the tradeoff is exactly that you don't get feature names.

尽管如此,仍然可以从 HashingVectorizer 获取特征名称;为此,您需要将其应用于文档样本,存储哪些哈希对应哪些单词,并通过这种方式了解这些哈希的含义,即特征名称是什么.可能会有冲突,所以不可能 100% 确定特征名称是正确的,但通常这种方法可以正常工作.这种方法在 eli5 库中实现;见 http://eli5.readthedocs.io/en/以 latest/tutorials/sklearn-text.html#debugging-hashingvectorizer 为例.你将不得不做这样的事情,使用 InvertableHashingVectorizer:

It is still possible to get feature names from HashingVectorizer though; to do this you need to apply it for a sample of documents, store which hashes correspond to which words, and this way learn what these hashes mean, i.e. what are the feature names. There may be collisions, so it is not possible to be 100% sure the feature name is correct, but usually this approach works ok. This approach is implemented in eli5 library; see http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer for an example. You will have to do something like this, using InvertableHashingVectorizer:

from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec)  # vec is a HashingVectorizer instance
# X_sample is a sample from contents; you can use the 
# whole contents array, or just e.g. every 10th element
ivec.fit(content_sample)  
hashing_feat_names = ivec.get_feature_names()

然后你可以使用 hashing_feat_names 作为你的特征名称,因为 TfidfTransformer 不会改变输入向量的大小,只是缩放相同的特征.

Then you can use hashing_feat_names as your feature names, as TfidfTransformer doesn't change input vector size and just scales the same features.

这篇关于'管道'对象没有属性 'get_feature_names'在 scikit-learn 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆