直接将文件读取为 SciPy 稀疏矩阵 [英] Read a file as SciPy sparse matrix directly
问题描述
是否可以读取空格分隔的文件,每行包含浮点数直接作为 SciPy 稀疏矩阵?
Is it possible to read a space separated file, with each line containing float numbers directly as SciPy sparse matrix?
推荐答案
Given: 一个空格分隔的文件,包含约 5600 万行和每行 25 个空格分隔的浮点数,具有很多稀疏性.
Given: A space separated file containing ~56 million rows and 25 space separated floating point numbers in each row with a lot of sparsity.
输出:尽快将文件转换为 SciPy CSR 稀疏矩阵
Output: Convert the file into SciPy CSR sparse matrix as fast as possible
可能有更好的解决方案,但在@CJR 提出了很多建议(其中一些我无法考虑)之后,这个解决方案对我有用.
May be there are better solutions out there, but this solution worked for me after a lot of suggestions from @CJR (some of which I couldn't take into account).
此外,使用 hdf5 可能有更好的解决方案,但是,这是使用 Pandas 数据帧的解决方案,并在 6.7 分钟内完成并在 32 核机器上占用大约 50 GB 的 RAM,用于 56,651,070 行和 25 个空间分隔的浮动每行中的点数非常稀疏.
Also, may be there is a better solution using hdf5, but, this is the solution using Pandas dataframe and finishes up in 6.7 minutes and takes around 50 GB of RAM on a 32 core machine for 56,651,070 rows and 25 space separated floating point numbers in each row with a lot of sparsity.
import numpy as np
import scipy.sparse as sps
import pandas as pd
import time
import swifter
start_time = time.time()
input_file_name = "df"
sep = " "
df = pd.read_csv(input_file_name)
df['array_column'] = df['array_column'].swifter.allow_dask_on_strings().apply(lambda x: np.fromstring(x, sep = sep), axis =1)
df_np_sp_matrix = sps.csr_matrix(np.stack(df['array_column'].to_numpy()))
print("--- %s seconds ---" % (time.time() - start_time))
输出:
--- 406.22810888290405 seconds ---
矩阵大小.
df_np_sp_matrix
输出:
<56651070x25 sparse matrix of type '<class 'numpy.float64'>'
with 508880850 stored elements in Compressed Sparse Row format>
这篇关于直接将文件读取为 SciPy 稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!