是否可以将PCA应用于任何文本分类? [英] is it possible Apply PCA on any Text Classification?

查看:386
本文介绍了是否可以将PCA应用于任何文本分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python进行分类.我在网页上使用的是朴素贝叶斯MultinomialNB分类器(从Web到文本检索数据表单,稍后我将该文本分类:Web分类).

I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification).

现在,我正在尝试将PCA应用于此数据,但是python出现了一些错误.

Now, I'm trying to apply PCA on this data, but python is giving some errors.

我用于朴素贝叶斯分类的代码:

My code for classification with Naive Bayes :

from sklearn import PCA
from sklearn import RandomizedPCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
classifer = MultinomialNB(alpha=.01)

x_train = vectorizer.fit_transform(temizdata)
classifer.fit(x_train, y_train)

这个朴素的贝叶斯分类给出了输出:

This naive bayes classification gives that output :

>>> x_train
<43x4429 sparse matrix of type '<class 'numpy.int64'>'
    with 6302 stored elements in Compressed Sparse Row format>

>>> print(x_train)
(0, 2966)   1
(0, 1974)   1
(0, 3296)   1
..
..
(42, 1629)  1
(42, 2833)  1
(42, 876)   1

比起尝试将PCA应用于我的数据(temizdata):

Than I try to apply PCA on my data (temizdata) :

>>> v_temizdata = vectorizer.fit_transform(temizdata)
>>> pca_t = PCA.fit_transform(v_temizdata)
>>> pca_t = PCA().fit_transform(v_temizdata)

但这会导致错误:

raise TypeError('通过了稀疏矩阵,但是密集'TypeError:A 稀疏矩阵已通过,但需要密集数据.使用X.toarray() 转换为密集的numpy数组.

raise TypeError('A sparse matrix was passed, but dense ' TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

我将矩阵转换为稠密矩阵或numpy数组.然后我尝试对新的稠密矩阵进行分类,但出现错误.

I convert matrix to densematrix or numpy array. Then I tried to classfy new densematrix , but I have error.

我的主要目的是测试PCA对文本分类的影响.

My main aim is that test PCA effect on Classification on text.

转换为密集数组:

v_temizdatatodense = v_temizdata.todense()
pca_t = PCA().fit_transform(v_temizdatatodense)

最后尝试classfy:

Finally try classfy :

classifer.fit(pca_t,y_train)

最终类别错误:

提高ValueError(输入X必须为非负数")ValueError:输入X 必须为非负数

raise ValueError("Input X must be non-negative") ValueError: Input X must be non-negative

一方面,我的数据(temizdata)仅放入朴素贝叶斯(Naive Bayes),另一方面,temizdata首先将其放入PCA(用于归约输入),而不是进行分类. __

On one side my data (temizdata) is put in Naive Bayes only, on the other side temizdata firstly put in PCA (for reduce inputs) than classify. __

推荐答案

我会使用scikits-learn的

Rather than converting a sparse matrix to dense (which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:

svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data) 

并且,引用TruncatedSVD文档中的内容:

And, citing from the TruncatedSVD documentation:

尤其是,截断的SVD可以处理sklearn.feature_extraction.text中的矢量化程序返回的项数/tf-idf矩阵.在这种情况下,它被称为潜在语义分析(LSA).

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

这正是您的用例.

这篇关于是否可以将PCA应用于任何文本分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆