用 pandas 创建稀疏矩阵,并用.dat文件的另一列中索引值[x,y]的.dat文件的其他两列中的值填充它 [英] Create sparse matrix with pandas and fill it with values from one column of .dat file at indexes [x,y] from other two columns of .dat file

查看:118
本文介绍了用 pandas 创建稀疏矩阵,并用.dat文件的另一列中索引值[x,y]的.dat文件的其他两列中的值填充它的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个.dat文件,其中包含三列- userID artistID weight . 使用Python,我使用 data = pd.read_table('train.dat')将数据读入pandas Dataframe.

I have a .dat file, that contains three columns - userID, artistID and weight. Using Python, I read the data into pandas Dataframe with data = pd.read_table('train.dat').

我想创建一个稀疏矩阵(/2D数组),该矩阵采用<的前两列(' userID ',' artistID ')中的值em> data 数据框作为索引,第三列作为值(' weight '). 在数据框中未给出的索引组合应为 NaN .

I want to create a sparse matrix (/2D array), that takes the values from the first two columns ('userID', 'artistID') of data Dataframe as indexes and the third column as value ('weight'). Combinations of indexes, not given in the Dataframe, should be NaN.

我尝试使用for循环创建一个空的numpy数组并填充它,但这需要花费很多时间(train.dat中大约有10万行).

I tried creating an empty numpy array and filling it, using a for loop, but it takes a lot of time (there are around 100k rows in train.dat).

import csv
import numpy as np

f = open("train.dat", "rt")
reader = csv.reader(f, delimiter="\t")
next(reader)
data = [d for d in reader]
f.close()

data = np.array(data, dtype=float)
col = int(a[:,0].max()) + 1
row = int(a[:,1].max()) + 1

empty = np.empty((row, col))
empty[:] = np.nan

for d in data:
   empty[int(d[0]), int(d[1])] = d[2]

还尝试创建coo_matrix并将其转换为csr_matrix(这样我就可以使用索引访问数据),但是索引会重置.

Also tried creating a coo_matrix and converting it to csr_matrix (so I could access data with indexes), but indexes reset.

import scipy.sparse as sps
import pandas as pd

data = pd.read_table('train.dat')
matrix = sps.coo_matrix((data.weight, (data.index.labels[0], data.index.labels[1])))
matrix = matrix.tocsr()

数据示例:

userID    artistID  weight
    45           7      0.7114779874213837
   204         144      0.46399999999999997
    36         650      2.4232887490165225
   140         146      1.0146699266503667
   170          31      1.4124783362218372
   240         468      0.6529992406985573

推荐答案

将您的数据复制到文件中

With your data copied to file:

In [290]: data = pd.read_csv('stack48133358.txt',delim_whitespace=True)
In [291]: data
Out[291]: 
   userID  artistID    weight
0      45         7  0.711478
1     204       144  0.464000
2      36       650  2.423289
3     140       146  1.014670
4     170        31  1.412478
5     240       468  0.652999
In [292]: M = sparse.csr_matrix((data.weight, (data.userID, data.artistID)))
In [293]: M
Out[293]: 
<241x651 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [294]: print(M)
  (36, 650)     2.42328874902
  (45, 7)       0.711477987421
  (140, 146)    1.01466992665
  (170, 31)     1.41247833622
  (204, 144)    0.464
  (240, 468)    0.652999240699

我还可以使用genfromtxt加载该文件:

I can also load that file with genfromtxt:

In [307]: data = np.genfromtxt('stack48133358.txt',dtype=None, names=True)
In [308]: data
Out[308]: 
array([( 45,   7,  0.71147799), (204, 144,  0.464     ),
       ( 36, 650,  2.42328875), (140, 146,  1.01466993),
       (170,  31,  1.41247834), (240, 468,  0.65299924)],
      dtype=[('userID', '<i4'), ('artistID', '<i4'), ('weight', '<f8')])
In [309]: M = sparse.csr_matrix((data['weight'], (data['userID'], data['artistID
     ...: '])))
In [310]: M
Out[310]: 
<241x651 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

这篇关于用 pandas 创建稀疏矩阵,并用.dat文件的另一列中索引值[x,y]的.dat文件的其他两列中的值填充它的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆