来自scikit-learn包的CountVectorizer的问题 [英] Problem with CountVectorizer from scikit-learn package

查看:223
本文介绍了来自scikit-learn包的CountVectorizer的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个电影评论数据集.它有两列:'class''reviews'.我已经完成了大多数常规的预处理工作,例如:降低字符,删除停用词,删除标点符号.在预处理结束时,每个原始评论看起来像是用空格分隔符分隔的单词.

我想使用CountVectorizer,然后使用TF-IDF来创建我的数据集的特征,以便我可以使用Random Forest进行分类/文本识别.我调查了网站,然后尝试做网站.这是我的代码:

data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer()
new_X = vectorizer.fit_transform(X)
tfidfconverter = TfidfTransformer()  
X1 = tfidfconverter.fit_transform(new_X)
print(X1)

但是,我得到了这个输出...

(0, 0)  1.0

这根本没有意义.我处理了一些参数,并注释掉了有关TF-IDF的部分.这是我的代码:

data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer(analyzer = 'char_wb',  \
                         tokenizer = None, \
                         preprocessor = None, \
                         stop_words = None, \
                         max_features = 5000)

new_X = vectorizer.fit_transform(X)
print(new_X)

这是我的输出:

(0, 4)  1
(0, 6)  1
(0, 2)  1
(0, 5)  1
(0, 1)  2
(0, 3)  1
(0, 0)  2

我想念什么吗?还是我太菜鸟不懂?我所理解和想要的只是/如果我进行了变换,我将收到一个具有这么多功能(关于单词及其频率)加上标签列的新数据集.但是,我得到的却远非如此.

我再说一遍,我想要做的是在我的数据集中添加一个带有数字,单词作为特征的新数据集,以便Random Forest或其他分类算法可以对此做任何事情.

谢谢.

顺便说一句,这是我的数据集的前五行:

   class                                            reviews
0      1                         da vinci code book awesome
1      1  first clive cussler ever read even books like ...
2      1                            liked da vinci code lot
3      1                            liked da vinci code lot
4      1            liked da vinci code ultimatly seem hold

解决方案

假设您碰巧有一个数据框:

data
    class   reviews
0   1   da vinci code book aw...
1   1   first clive cussler ever read even books lik...
2   1   liked da vinci cod...
3   1   liked da vinci cod...
4   1   liked da vinci code ultimatly seem...

分为特征和结果:

y = data['class']
X = data.drop('class', axis = 1)

然后,按照您的管道,您可以为任何ML算法准备数据,如下所示:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
new_X = vectorizer.fit_transform(X.reviews)
new_X
<5x18 sparse matrix of type '<class 'numpy.int64'>'

new_X可以按原样用于您的其他管道中,也可以转换为密集矩阵:

new_X.todense()
matrix([[1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1]],
       dtype=int64)
        with 30 stored elements in Compressed Sparse Row format>

此矩阵中的行表示原始reviews列中的行,而各列表示单词数.如果您对哪一列"指的是您可能要使用的单词感兴趣,

vectorizer.vocabulary_
{'da': 6,
 'vinci': 17,
 'code': 4,
 'book': 1,
 'awesome': 0,
 'first': 9,
 'clive': 3,
 'cussler': 5,
....

其中key是一个单词,而value是上述矩阵中的列索引(实际上,您可以推断出该列索引对应于有序词汇表,而'awesome'负责第0列,依此类推)./p>

您可以像这样进一步处理管道:

tfidfconverter = TfidfTransformer()  
X1 = tfidfconverter.fit_transform(new_X)
X1
<5x18 sparse matrix of type '<class 'numpy.float64'>'
    with 30 stored elements in Compressed Sparse Row format>

最后,您可以将预处理的数据输入到RandomForest:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X1, y)

此代码在我的笔记本上运行没有错误. 请让我们知道这是否可以解决您的问题!

I have a dataset of movie reviews. It has two columns: 'class' and 'reviews'. I have done most of the routine preprocessing stuff, such as: lowering the characters, removing stop words, removing punctuation marks. At the end of preprocessing, each original review looks like words separated by space delimiter.

I want to use CountVectorizer and then TF-IDF in order to create features of my dataset so i can do classification/text recognition with Random Forest. I looked into websites and i tried to do how they did. This is my code:

data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer()
new_X = vectorizer.fit_transform(X)
tfidfconverter = TfidfTransformer()  
X1 = tfidfconverter.fit_transform(new_X)
print(X1)

But, i get this output...

(0, 0)  1.0

which doesn't make sense at all. I tackled with some parameters and commented out the parts about TF-IDF. Here's my code:

data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer(analyzer = 'char_wb',  \
                         tokenizer = None, \
                         preprocessor = None, \
                         stop_words = None, \
                         max_features = 5000)

new_X = vectorizer.fit_transform(X)
print(new_X)

and this is my output:

(0, 4)  1
(0, 6)  1
(0, 2)  1
(0, 5)  1
(0, 1)  2
(0, 3)  1
(0, 0)  2

Am i missing something? Or am i too noob to understand? All i understood and want was/is if i do transform, i will receive a new dataset with so many features (regarding the words and their frequencies) plus label column. But, what i am getting is so far from it.

I repeat, all i want is to have a new dataset out of my dataset with reviews in which it has numbers, words as features, so Random Forest or other classification algorithms can do anything with it.

Thanks.

Btw, this is first five rows of my dataset:

   class                                            reviews
0      1                         da vinci code book awesome
1      1  first clive cussler ever read even books like ...
2      1                            liked da vinci code lot
3      1                            liked da vinci code lot
4      1            liked da vinci code ultimatly seem hold

解决方案

Suppose you happen to have a dataframe:

data
    class   reviews
0   1   da vinci code book aw...
1   1   first clive cussler ever read even books lik...
2   1   liked da vinci cod...
3   1   liked da vinci cod...
4   1   liked da vinci code ultimatly seem...

Separate into features and outcomes:

y = data['class']
X = data.drop('class', axis = 1)

Then, following your pipeline, you can prepare your data for any ML algo like this:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
new_X = vectorizer.fit_transform(X.reviews)
new_X
<5x18 sparse matrix of type '<class 'numpy.int64'>'

This new_X can be used in your further pipeline "as is" or converted to dense matrix:

new_X.todense()
matrix([[1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1]],
       dtype=int64)
        with 30 stored elements in Compressed Sparse Row format>

Rows in this matrix represent rows in the original reviews column and columns represent counts of words. In case you're interested in what column refers to what word you may do:

vectorizer.vocabulary_
{'da': 6,
 'vinci': 17,
 'code': 4,
 'book': 1,
 'awesome': 0,
 'first': 9,
 'clive': 3,
 'cussler': 5,
....

where key is a word and value is column index in the above matrix (you may infer, actually, that column index correspond to ordered vocabulary, with 'awesome' responsible for 0th column and so on).

You may further proceed with your pipeline like this:

tfidfconverter = TfidfTransformer()  
X1 = tfidfconverter.fit_transform(new_X)
X1
<5x18 sparse matrix of type '<class 'numpy.float64'>'
    with 30 stored elements in Compressed Sparse Row format>

Finally, you can feed your preprocessed data into RandomForest:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X1, y)

This code runs without error on my notebook. Please, let us know if this solves your problem!

这篇关于来自scikit-learn包的CountVectorizer的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆