有效构造Numpy中的稀疏双向性矩阵 [英] Efficiently constructing sparse biadjacency matrix in Numpy

查看：116 发布时间：2018/5/25 17:46:00 python performance numpy graph

本文介绍了有效构造Numpy中的稀疏双向性矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图将这个CSV文件加载到一个稀疏的numpy矩阵中，该矩阵表示这个用户到subreddit二分图的双向性矩阵： http://figshare.com/articles/reddit_user_posting_behavior/874101

I'm trying to load this CSV file into a sparse numpy matrix, which would represent the biadjacency matrix of this user-to-subreddit bipartite graph: http://figshare.com/articles/reddit_user_posting_behavior/874101

下面是一个例子：

603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww

有876,961行（每个用户一个）和15,122个subreddits以及总计8,495,597个用户到subreddit的关联。

There are 876,961 lines (one per user) and 15,122 subreddits and a total of 8,495,597 user-to-subreddit associations.

这里是代码，我现在有，并且需要20分钟在我的MacBook Pro上运行：

Here's the code which I have right now, and which takes 20 minutes to run on my MacBook Pro:

import numpy as np
from scipy.sparse import csr_matrix 

row_list = []
entry_count = 0
all_reddits = set()
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for x in f:
        pieces = x.rstrip().split(",")
        user = pieces[0]
        reddits = pieces[1:]
        entry_count += len(reddits)
        for r in reddits: all_reddits.add(r)
        row_list.append(np.array(reddits))

reddits_list = np.array(list(all_reddits))

# 5s to get this far

rows = np.zeros((entry_count,))
cols = np.zeros((entry_count,))
data =  np.ones((entry_count,))
i=0
user_idx = 0
for row in row_list:
    for reddit_idx in np.nonzero(np.in1d(reddits_list,row))[0]:
        cols[i] = user_idx
        rows[i] = reddit_idx
        i+=1
    user_idx+=1
adj = csr_matrix( (data,(rows,cols)), shape=(len(reddits_list), len(row_list)) )

似乎很难相信这个速度如此快......将82MB文件加载到列表列表需要5s，但构建稀疏矩阵需要200那个时候。我能做些什么来加快速度？是否有一些文件格式可以在不到20分钟内将此CSV转换成更快速的导入？是否有一些明显昂贵的操作，我在这里做的不好？我试着建立一个密集的矩阵，我试着创建一个 lil_matrix 和一个 dok_matrix 并且分配 1 是一次一个，并且不会更快。

It seems hard to believe that this is as fast as this can go... Loading the 82MB file into a list of lists takes 5s but building out the sparse matrix takes 200 times that. What can I do to speed this up? Is there some file format that I can convert this CSV into in less than 20min that would import more quickly? Is there some obviously-expensive operation I'm doing here that's not good? I've tried building a dense matrix and I've tried creating a lil_matrix and a dok_matrix and assigning the 1's one at a time, and that's no faster.

推荐答案

Couldn睡不着觉，试了最后一件事......我终于可以把它降到10秒，最后：

Couldn't sleep, tried one last thing... I was able to get it down to 10 seconds this way, in the end:

import numpy as np from scipy.sparse import csr_matrix user_ids = [] subreddit_ids = [] subreddits = {} i=0 with open("reddit_user_posting_behavior.csv", 'r') as f: for line in f: for sr in line.rstrip().split(",")[1:]: if sr not in subreddits: subreddits[sr] = len(subreddits) user_ids.append(i) subreddit_ids.append(subreddits[sr]) i+=1 adj = csr_matrix( ( np.ones((len(userids),)), (np.array(subreddit_ids),np.array(user_ids)) ), shape=(len(subreddits), i) )

这篇关于有效构造Numpy中的稀疏双向性矩阵的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有效构造Numpy中的稀疏双向性矩阵 [英] Efficiently constructing sparse biadjacency matrix in Numpy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

有效构造Numpy中的稀疏双向性矩阵 [英] Efficiently constructing sparse biadjacency matrix in Numpy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭