直接将文件读取为 SciPy 稀疏矩阵 [英] Read a file as SciPy sparse matrix directly

查看：64 发布时间：2021/7/16 21:16:21 scipy sparse-matrix

本文介绍了直接将文件读取为 SciPy 稀疏矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否可以读取空格分隔的文件，每行包含浮点数直接作为 SciPy 稀疏矩阵?

Is it possible to read a space separated file, with each line containing float numbers directly as SciPy sparse matrix?

推荐答案

Given: 一个空格分隔的文件，包含约 5600 万行和每行 25 个空格分隔的浮点数，具有很多稀疏性.

Given: A space separated file containing ~56 million rows and 25 space separated floating point numbers in each row with a lot of sparsity.

输出:尽快将文件转换为 SciPy CSR 稀疏矩阵

Output: Convert the file into SciPy CSR sparse matrix as fast as possible

可能有更好的解决方案，但在@CJR 提出了很多建议(其中一些我无法考虑)之后，这个解决方案对我有用.

May be there are better solutions out there, but this solution worked for me after a lot of suggestions from @CJR (some of which I couldn't take into account).

此外，使用 hdf5 可能有更好的解决方案，但是，这是使用 Pandas 数据帧的解决方案，并在 6.7 分钟内完成并在 32 核机器上占用大约 50 GB 的 RAM，用于 56,651,070 行和 25 个空间分隔的浮动每行中的点数非常稀疏.

Also, may be there is a better solution using hdf5, but, this is the solution using Pandas dataframe and finishes up in 6.7 minutes and takes around 50 GB of RAM on a 32 core machine for 56,651,070 rows and 25 space separated floating point numbers in each row with a lot of sparsity.

import numpy as np
import scipy.sparse as sps
import pandas as pd
import time
import swifter

start_time = time.time()
input_file_name = "df"
sep = " "
df = pd.read_csv(input_file_name)
df['array_column'] = df['array_column'].swifter.allow_dask_on_strings().apply(lambda x: np.fromstring(x, sep = sep), axis =1)
df_np_sp_matrix = sps.csr_matrix(np.stack(df['array_column'].to_numpy()))
print("--- %s seconds ---" % (time.time() - start_time))

输出:

--- 406.22810888290405 seconds ---

矩阵大小.

df_np_sp_matrix

输出:

<56651070x25 sparse matrix of type '<class 'numpy.float64'>'
with 508880850 stored elements in Compressed Sparse Row format>

这篇关于直接将文件读取为 SciPy 稀疏矩阵的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

直接将文件读取为 SciPy 稀疏矩阵 [英] Read a file as SciPy sparse matrix directly

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

直接将文件读取为 SciPy 稀疏矩阵 [英] Read a file as SciPy sparse matrix directly

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭