带有 pandas 数组的hstack csr矩阵 [英] hstack csr matrix with pandas array
问题描述
我正在对Amazon Reviews进行练习,下面是代码. 基本上,我无法在应用BoW之后将列(熊猫数组)添加到CSR矩阵中. 即使两个矩阵中的行数匹配,我也无法通过.
I am doing an exercise on Amazon Reviews, Below is the code. Basically I am not able to add column (pandas array) to CSR Matrix which i got after applying BoW. Even though the number of rows in both matrices matches i am not able to get through.
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.manifold import TSNE
#Create Connection to sqlite3
con = sqlite3.connect('C:/Users/609316120/Desktop/Python/Amazon_Review_Exercise/database/database.sqlite')
filtered_data = pd.read_sql_query("""select * from Reviews where Score != 3""", con)
def partition(x):
if x < 3:
return 'negative'
return 'positive'
actualScore = filtered_data['Score']
actualScore.head()
positiveNegative = actualScore.map(partition)
positiveNegative.head(10)
filtered_data['Score'] = positiveNegative
filtered_data.head(1)
filtered_data.shape
display = pd.read_sql_query("""select * from Reviews where Score !=3 and Userid="AR5J8UI46CURR" ORDER BY PRODUCTID""", con)
sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape
display = pd.read_sql_query(""" select * from reviews where score != 3 and id=44737 or id = 64422 order by productid""", con)
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
final['Score'].value_counts()
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)
final_counts.shape
type(final_counts)
positive_negative = final['Score']
#Below is giving error
final_counts = hstack((final_counts,positive_negative))
推荐答案
sparse.hstack
将输入的coo
格式矩阵组合到新的coo
格式矩阵中.
sparse.hstack
combines the coo
format matrices of the inputs into a new coo
format matrix.
final_counts
是csr
矩阵,因此sparse.coo_matrix(final_counts)
转换是微不足道的.
final_counts
is a csr
matrix, so the sparse.coo_matrix(final_counts)
conversion is trivial.
positive_negative
是DataFrame的列.看
positive_negative
is a column of a DataFrame. Look at
sparse.coo_matrix(positive_negative)
它可能是(1,n)稀疏矩阵.但是要与final_counts
结合使用,它必须是(1,n)形状.
It probably is a (1,n) sparse matrix. But to combine it with final_counts
it needs to be (1,n) shaped.
尝试创建稀疏矩阵并将其转置:
Try creating the sparse matrix, and transposing it:
sparse.hstack((final_counts, sparse.coo_matrix(positive_negative).T))
这篇关于带有 pandas 数组的hstack csr矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!