在以下情况下如何运行随机分类器 [英] How to run a random classifer in the following case

查看:57
本文介绍了在以下情况下如何运行随机分类器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试进行情绪分析案例,并且尝试针对以下内容运行随机分类器:

I am trying to experiment with sentiment analysis case and I am trying to run a random classifier for the following:

|Topic               |value|label|
|Apples are great    |-0.99|0    |
|Balloon is red      |-0.98|1    |
|cars are running    |-0.93|0    |
|dear diary          |0.8  |1    |
|elephant is huge    |0.91 |1    |
|facebook is great   |0.97 |0    |

从sklearn库中将其拆分为训练测试后,

after splitting it into train test from sklearn library,

对于主题"列,我正在执行以下操作,以便计数矢量化器对其进行处理:

I am doing the following for the Topic column for the count vectoriser to work upon it:

x = train.iloc[:,0:2]
#except for alphabets removing all punctuations
x.replace("[^a-zA-Z]"," ",regex=True, inplace=True)

#convert to lower case
x = x.apply(lambda a: a.astype(str).str.lower())

x.head(2)

此后,我将countvectorizer应用于主题列,将其与值列一起转换并应用随机分类器.

After that I apply countvectorizer to the topics column, convert it together with value column and apply Random classifier.

## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

## implement BAG OF WORDS
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])

train_set = pd.concat([x['compound'], pd.DataFrame(traindataset)], axis=1)

# implement RandomForest Classifier
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,train['label'])

但是我收到一个错误:

TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'csr_matrix'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-41-7a1f9b292921> in <module>()
      1 # implement RandomForest Classifier
      2 randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
----> 3 randomclassifier.fit(train_set,train['label'])

4 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

ValueError: setting an array element with a sequence.

我的想法是:

我收到的值来自应用vader-sentiment,我也想将其应用到我的随机分类器中,以查看vader分数对输出的影响.

The values I received are from applying vader-sentiment and I want to apply that too - to my random classifier to see the impact of vader scores on the output.

也许有一种方法可以将value列中的数据与生成的稀疏矩阵traindata相乘

Maybe is there a way to multiply the data in the value column with sparse matrix traindata generated

在这种情况下,任何人都可以告诉我该怎么做.

Can anyone please tell me how to do that in this case.

推荐答案

问题是将另一列连接为稀疏矩阵( countvector.fit_transform 的输出).为了简单起见,假设您的培训是:

The issue is concatenating another column to sparse matrix (the output from countvector.fit_transform ). For simplicity sake, let's say your training is:

x = pd.DataFrame({'Topics':['Apples are great','Balloon is red','cars are running',
                           'dear diary','elephant is huge','facebook is great'],
                  'value':[-0.99,-0.98,-0.93,0.8,0.91,0.97,],
                  'label':[0,1,0,1,1,0]})

您可以看到这给您带来了一些奇怪的东西:

You can see this gives you something weird:

countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])

train_set = pd.concat([x['value'], pd.DataFrame(traindataset)], axis=1)

train_set.head(2)

    value   0
0   -0.99   (0, 0)\t1\n (0, 1)\t1
1   -0.98   (0, 3)\t1\n (0, 10)\t1

可以将您的稀疏数组转换为密集的numpy数组,然后您的pandas数据框将起作用,但是,如果您的数据集很大,则这将是非常昂贵的.要使其稀疏,可以执行以下操作:

It is possible to convert your sparse to a dense numpy array and then your pandas dataframe will work, however if your dataset is huge this is extremely costly. To keep it as sparse, you can do:

from scipy import sparse

train_set = scipy.sparse.hstack([sparse.csr_matrix(x['value']).reshape(-1,1),traindataset])

randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,x['label'])

还请查看稀疏帮助页面

这篇关于在以下情况下如何运行随机分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆