直接将文件读取为 SciPy 稀疏矩阵 [英] Read a file as SciPy sparse matrix directly

查看:64
本文介绍了直接将文件读取为 SciPy 稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以读取空格分隔的文件,每行包含浮点数直接作为 SciPy 稀疏矩阵?

Is it possible to read a space separated file, with each line containing float numbers directly as SciPy sparse matrix?

推荐答案

Given: 一个空格分隔的文件,包含约 5600 万行和每行 25 个空格分隔的浮点数,具有很多稀疏性.

Given: A space separated file containing ~56 million rows and 25 space separated floating point numbers in each row with a lot of sparsity.

输出:尽快将文件转换为 SciPy CSR 稀疏矩阵

Output: Convert the file into SciPy CSR sparse matrix as fast as possible

可能有更好的解决方案,但在@CJR 提出了很多建议(其中一些我无法考虑)之后,这个解决方案对我有用.

May be there are better solutions out there, but this solution worked for me after a lot of suggestions from @CJR (some of which I couldn't take into account).

此外,使用 hdf5 可能有更好的解决方案,但是,这是使用 Pandas 数据帧的解决方案,并在 6.7 分钟内完成并在 32 核机器上占用大约 50 GB 的 RAM,用于 56,651,070 行和 25 个空间分隔的浮动每行中的点数非常稀疏.

Also, may be there is a better solution using hdf5, but, this is the solution using Pandas dataframe and finishes up in 6.7 minutes and takes around 50 GB of RAM on a 32 core machine for 56,651,070 rows and 25 space separated floating point numbers in each row with a lot of sparsity.

import numpy as np
import scipy.sparse as sps
import pandas as pd
import time
import swifter

start_time = time.time()
input_file_name = "df"
sep = " "
df = pd.read_csv(input_file_name)
df['array_column'] = df['array_column'].swifter.allow_dask_on_strings().apply(lambda x: np.fromstring(x, sep = sep), axis =1)
df_np_sp_matrix = sps.csr_matrix(np.stack(df['array_column'].to_numpy()))
print("--- %s seconds ---" % (time.time() - start_time))

输出:

--- 406.22810888290405 seconds ---

矩阵大小.

df_np_sp_matrix

输出:

<56651070x25 sparse matrix of type '<class 'numpy.float64'>'
with 508880850 stored elements in Compressed Sparse Row format>

这篇关于直接将文件读取为 SciPy 稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆