如何在python中有效创建稀疏向量? [英] How to efficiently create a sparse vector in python?

查看:592
本文介绍了如何在python中有效创建稀疏向量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字典的字典,其中每个值应该是一个巨大的稀疏向量(〜700000个元素,也许更多).我如何有效地增长/建立这种数据结构. 现在,我的实现仅适用于较小的尺寸.

I have a dictionary of keys where each value should be a sparse vector of a huge size (~ 700000 elements, maybe more). How do I efficiently grow / build this data structure. Right now my implementation works only for smaller sizes.

myvec = defaultdict(list)
for id in id_data:
    for item in item_data:
        if item in item_data[id]:
            myvec[id].append(item * 0.5)
        else:
            myvec[id].append(0)

上面的代码与大文件一起使用时,会很快耗尽所有可用的内存.我尝试删除myvec[id].append(0)条件并仅存储非零值,因为每个myvec[id]列表的长度都是恒定的.这可以在我的巨大测试文件上以较高的内存消耗实现,但是我宁愿找到一种更好的方法来实现它.

The above code when used with huge files quickly eats up all the available memory. I tried removing the myvec[id].append(0) condition and store only non-zero values because the length of each myvec[id] list is constant. That worked on my huge test file with a decent memory consumption but I'd rather find a better way to do it.

我知道有不同类型的稀疏数组/矩阵用于此目的,但是我不知道哪个更好.我尝试使用numpy包中的lil_matrix而不是myvec dict,但事实证明它比上述代码要慢得多.

I know that there are different type of sparse arrays/matrices for this purpose but I have no intuition which one is better. I tried to use lil_matrix from numpy package instead of myvec dict but it turned out to be much slower than the above code.

因此,问题基本上可以归结为以下两个问题:

So the problem basically boils down to the following two questions:

  1. 是否可以在python中动态创建稀疏数据结构 ?

如何才能以如此快的速度创建这样的稀疏数据结构?

How can one create such sparse data structure with decent speed?

推荐答案

追加到一个或多个列表总是比追加到numpy.arraysparse矩阵(将数据存储在多个numpy中)更快.数组).当必须逐步增长矩阵时,lil应该是最快的,但它仍然比直接使用列表慢.

Appending to a list (or lists) will always be faster than appending to a numpy.array or to a sparse matrix (which stores data in several numpy arrays). lil is supposed to be the fastest when you have to grow the matrix incrementally, but it still will slower than working directly with lists.

Numpy数组的大小固定.因此np.append函数实际上是通过将旧数据与新数据连接在一起来创建新数组的.

Numpy arrays have a fixed size. So the np.append function actually creates a new array by concatenating the old with the new data.

如果您给我们一些数据,那么示例代码会更有用,因此我们可以剪切,粘贴并运行.

You example code would be more useful if you gave us some data, so we cut, paste and run.

为简单起见,请定义

data_dict=dict(one=[1,0,2,3,0,0,4,5,0,0,6])

可以使用以下方法直接创建稀疏矩阵:

Sparse matrices can be created directly from this with:

sparse.coo_matrix(data_dict['one'])

其属性为:

data:  array([1, 2, 3, 4, 5, 6])
row:   array([0, 0, 0, 0, 0, 0], dtype=int32)
col:   array([ 0,  2,  3,  6,  7, 10], dtype=int32)

sparse.lil_matrix(id_data['one'])
data: array([[1, 2, 3, 4, 5, 6]], dtype=object)
rows: array([[0, 2, 3, 6, 7, 10]], dtype=object)

coo版本的速度快很多.

稀疏矩阵仅保存非零数据,但还必须保存索引.还有一种字典格式,它使用元组(row,col)作为键.

The sparse matrix only saves the nonzero data, but it also has to save an index. There is also a dictionary format, which uses a tuple (row,col) as the key.

增量构造的示例是:

llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
    llm[0,i]=data_dict['one'][i]

对于这种小情况,这种增量方法更快.

For this small case this incremental approach is faster.

通过仅将非零项添加到稀疏矩阵中,我得到了更快的速度:

I get even better speed by only adding the nonzero terms to the sparse matrix:

llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
    if data_dict['one'][i]!=0:
       llm[0,i]=data_dict['one'][i]

我可以想象将其修改为您的默认dict示例.您可以记录item * 0.5值的附加位置,而不是myvec[id].append(0)(无论是在单独的列表中还是通过lil_matrix附加.)都需要进行一些尝试才能使此概念适应默认字典.

I can imagine adapting this to your default dict example. Instead of myvec[id].append(0), you keep a record of where you appended the item * 0.5 values (whether in a separate list, or via a lil_matrix. It would take some experimenting to adapt this idea to a default dictionary.

因此,基本上,目标是创建2个列表:

So basically the goal is to create 2 lists:

data = [1, 2, 3, 4, 5, 6]
cols = [ 0,  2,  3,  6,  7, 10]

是否根据这些内容创建稀疏矩阵取决于您还需要对数据进行什么处理.

Whether you create a sparse matrix from these or not depends on what else you need to do with the data.

这篇关于如何在python中有效创建稀疏向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆