用 pandas 创建稀疏矩阵,并用.dat文件的另一列中索引值[x,y]的.dat文件的其他两列中的值填充它 [英] Create sparse matrix with pandas and fill it with values from one column of .dat file at indexes [x,y] from other two columns of .dat file
问题描述
我有一个.dat文件,其中包含三列- userID , artistID 和 weight . 使用Python,我使用 data = pd.read_table('train.dat')将数据读入pandas Dataframe.
I have a .dat file, that contains three columns - userID, artistID and weight. Using Python, I read the data into pandas Dataframe with data = pd.read_table('train.dat').
我想创建一个稀疏矩阵(/2D数组),该矩阵采用<的前两列(' userID ',' artistID ')中的值em> data 数据框作为索引,第三列作为值(' weight '). 在数据框中未给出的索引组合应为 NaN .
I want to create a sparse matrix (/2D array), that takes the values from the first two columns ('userID', 'artistID') of data Dataframe as indexes and the third column as value ('weight'). Combinations of indexes, not given in the Dataframe, should be NaN.
我尝试使用for循环创建一个空的numpy数组并填充它,但这需要花费很多时间(train.dat中大约有10万行).
I tried creating an empty numpy array and filling it, using a for loop, but it takes a lot of time (there are around 100k rows in train.dat).
import csv
import numpy as np
f = open("train.dat", "rt")
reader = csv.reader(f, delimiter="\t")
next(reader)
data = [d for d in reader]
f.close()
data = np.array(data, dtype=float)
col = int(a[:,0].max()) + 1
row = int(a[:,1].max()) + 1
empty = np.empty((row, col))
empty[:] = np.nan
for d in data:
empty[int(d[0]), int(d[1])] = d[2]
还尝试创建coo_matrix并将其转换为csr_matrix(这样我就可以使用索引访问数据),但是索引会重置.
Also tried creating a coo_matrix and converting it to csr_matrix (so I could access data with indexes), but indexes reset.
import scipy.sparse as sps
import pandas as pd
data = pd.read_table('train.dat')
matrix = sps.coo_matrix((data.weight, (data.index.labels[0], data.index.labels[1])))
matrix = matrix.tocsr()
数据示例:
userID artistID weight
45 7 0.7114779874213837
204 144 0.46399999999999997
36 650 2.4232887490165225
140 146 1.0146699266503667
170 31 1.4124783362218372
240 468 0.6529992406985573
推荐答案
将您的数据复制到文件中
With your data copied to file:
In [290]: data = pd.read_csv('stack48133358.txt',delim_whitespace=True)
In [291]: data
Out[291]:
userID artistID weight
0 45 7 0.711478
1 204 144 0.464000
2 36 650 2.423289
3 140 146 1.014670
4 170 31 1.412478
5 240 468 0.652999
In [292]: M = sparse.csr_matrix((data.weight, (data.userID, data.artistID)))
In [293]: M
Out[293]:
<241x651 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [294]: print(M)
(36, 650) 2.42328874902
(45, 7) 0.711477987421
(140, 146) 1.01466992665
(170, 31) 1.41247833622
(204, 144) 0.464
(240, 468) 0.652999240699
我还可以使用genfromtxt
加载该文件:
I can also load that file with genfromtxt
:
In [307]: data = np.genfromtxt('stack48133358.txt',dtype=None, names=True)
In [308]: data
Out[308]:
array([( 45, 7, 0.71147799), (204, 144, 0.464 ),
( 36, 650, 2.42328875), (140, 146, 1.01466993),
(170, 31, 1.41247834), (240, 468, 0.65299924)],
dtype=[('userID', '<i4'), ('artistID', '<i4'), ('weight', '<f8')])
In [309]: M = sparse.csr_matrix((data['weight'], (data['userID'], data['artistID
...: '])))
In [310]: M
Out[310]:
<241x651 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
这篇关于用 pandas 创建稀疏矩阵,并用.dat文件的另一列中索引值[x,y]的.dat文件的其他两列中的值填充它的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!