带有 pandas 数组的hstack csr矩阵 [英] hstack csr matrix with pandas array

查看:71
本文介绍了带有 pandas 数组的hstack csr矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对Amazon Reviews进行练习,下面是代码. 基本上,我无法在应用BoW之后将列(熊猫数组)添加到CSR矩阵中. 即使两个矩阵中的行数匹配,我也无法通过.

I am doing an exercise on Amazon Reviews, Below is the code. Basically I am not able to add column (pandas array) to CSR Matrix which i got after applying BoW. Even though the number of rows in both matrices matches i am not able to get through.

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.manifold import TSNE

#Create Connection to sqlite3
con = sqlite3.connect('C:/Users/609316120/Desktop/Python/Amazon_Review_Exercise/database/database.sqlite')

filtered_data = pd.read_sql_query("""select * from Reviews where Score != 3""", con)
def partition(x):
    if x < 3:
       return 'negative'
    return 'positive'

actualScore = filtered_data['Score']
actualScore.head()
positiveNegative = actualScore.map(partition)
positiveNegative.head(10)
filtered_data['Score'] = positiveNegative
filtered_data.head(1)
filtered_data.shape

display = pd.read_sql_query("""select * from Reviews where Score !=3 and Userid="AR5J8UI46CURR" ORDER BY PRODUCTID""", con)

sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)

final.shape

display = pd.read_sql_query(""" select * from reviews where score != 3 and id=44737 or id = 64422 order by productid""", con)

final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

final['Score'].value_counts()

count_vect = CountVectorizer()

final_counts = count_vect.fit_transform(final['Text'].values)

final_counts.shape

type(final_counts)

positive_negative = final['Score']

#Below is giving error
final_counts = hstack((final_counts,positive_negative))

推荐答案

sparse.hstack将输入的coo格式矩阵组合到新的coo格式矩阵中.

sparse.hstack combines the coo format matrices of the inputs into a new coo format matrix.

final_countscsr矩阵,因此sparse.coo_matrix(final_counts)转换是微不足道的.

final_counts is a csr matrix, so the sparse.coo_matrix(final_counts) conversion is trivial.

positive_negative是DataFrame的列.看

positive_negative is a column of a DataFrame. Look at

 sparse.coo_matrix(positive_negative)

它可能是(1,n)稀疏矩阵.但是要与final_counts结合使用,它必须是(1,n)形状.

It probably is a (1,n) sparse matrix. But to combine it with final_counts it needs to be (1,n) shaped.

尝试创建稀疏矩阵并将其转置:

Try creating the sparse matrix, and transposing it:

sparse.hstack((final_counts, sparse.coo_matrix(positive_negative).T))

这篇关于带有 pandas 数组的hstack csr矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆