我该如何“稀疏"?在两个值上? [英] How can I "sparsify" on two values?

查看:68
本文介绍了我该如何“稀疏"?在两个值上?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑熊猫系列s

n = 1000
s = pd.Series([0] * n + [1] * n, dtype=int)

s.memory_usage()

8080

我可以使用to_sparse

s.to_sparse(fill_value=0).memory_usage()

4080

但是我只有2种整数.我想我可以两次稀疏.有办法吗?

But I only have 2 types of integers. I'd think I could sparsify twice. Is there a way to do this?

推荐答案

由于您已用scipy进行了标记,因此我将向您展示scipy.sparse矩阵的样子:

Since you tagged this with scipy, I'll show you what a scipy.sparse matrix is like:

In [31]: n=100
In [32]: arr=np.array([[0]*n+[1]*n],int)
In [33]: M=sparse.csr_matrix(arr)
In [34]: M.data
Out[34]: 
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [35]: M.indices
Out[35]: 
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
       113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
       126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
       139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
       152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
       165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
       178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
       191, 192, 193, 194, 195, 196, 197, 198, 199], dtype=int32)
In [36]: M.indptr
Out[36]: array([  0, 100], dtype=int32)

它已将arrn元素替换为2个数组,每个数组均包含n/2元素.即使我将int替换为uint8M.indices数组仍将是int32.

It has replaced the n elements of arr with 2 arrays each with n/2 elements. Even if I replace the int with uint8, the M.indices array will still be int32.

您的pandas版本具有一半的内存使用情况,这表明它仅存储索引,并且有些人注意到data部分全为1.但这只是一个猜测.

The fact that your pandas version has half the memory usage,suggests that it is just storing the indices, and some how noting that the data part is all 1s. But that's just a guess.

您期望更大程度的分散化吗?

How much greater sparification do you expect?

===================

====================

http://pandas.pydata.org/pandas-docs/stable /sparse.html

此示例看起来像熊猫正在实现某种运行"压缩:

This example looks like pandas is implementing some sort of 'run' compression:

In [4]: sts
Out[4]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

它已识别出2个块,每个块的长度为2.它仍然必须将4个非填充值存储在某个数组中.

It has identified 2 blocks, of length 2 each. It still has to store the 4 nonfill values in some array.

等效于csr的稀疏(用于行数组):

A csr sparse equivalent (for a row array):

In [1052]: arr=np.random.rand(10)
In [1053]: arr[2:-2]=0
In [1055]: M=sparse.csr_matrix(arr)
In [1056]: M
Out[1056]: 
<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [1057]: M.data
Out[1057]: array([ 0.37875012,  0.73703368,  0.7935645 ,  0.22948213])
In [1058]: M.indices
Out[1058]: array([0, 1, 8, 9], dtype=int32)
In [1059]: M.indptr
Out[1059]: array([0, 4], dtype=int32)

如果填充值以块形式出现,则熊猫版本可能会更紧凑.但是我怀疑

The pandas version might be more compact if the fill values occur in blocks. But I suspect

0         1.0
1         1.0
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8         1.0
9         1.0

将产生相同的块.我看不到有证据表明它试图识别相同的1.0值,并将其存储为值加计数.

would produce the same blocks. I don't see evidence of it's trying to identify the identical 1.0 values, and storing those as a value plus a count.

================

================

根据@MaxU回答,您的ds存储了1000个1's,以及两个单元素数组,告诉它们这些值的存储位置.

Based on @MaxU answer your ds stores 1000 1's, and two single element arrays that tell it where those values are stored.

In [56]: sp.memory_usage()
Out[56]: 1080

In [57]: sp.sp_index
Out[57]:
BlockIndex
Block locations: array([1000])
Block lengths: array([1000])

只要非填充值大范围出现,block数组就会很小.但是,如果您在整个系列中分散了这1000个值,则将块的数量实质性地乘以

As long the nonfill values occur in big runs, the block arrays will be small. But if you scattered those 1000 values through out the series, you'd multiply the number of blocks substantially

 block locations: array([1,3,6,10,...])
 block lengths: array([1,1,1,2,1,...])

我可以想象csr布局和pandas块之间的映射,但是还没有弄清楚细节. csr布局旨在与2d数组一起使用,具有清晰的行和列概念.看起来稀疏的数据框只包含稀疏序列对象.

I can imagine mapping between the csr layout and the pandas blocks, but haven't worked out the details. The csr layout is meant to work with 2d arrays, with a clear concept of rows and columns. Looks like a sparse dataframe just contains sparse series objects.

==================

===================

https://stackoverflow.com/a/38157234/901925 显示了如何从稀疏数据帧值映射到稀疏矩阵.对于每个列(数据系列),它使用sp_valuesfill_valuesp_index.

https://stackoverflow.com/a/38157234/901925 shows how to map from sparse dataframe values to a scipy sparse matrix. For each column (data series) it uses sp_values,fill_value,sp_index.

pandas/pandas/sparse/scipy_sparse.py具有scipy稀疏与数据序列之间交互的代码.

pandas/pandas/sparse/scipy_sparse.py has the code for interaction between scipy sparse and data series.

==================

===================

kind='integer' produces sparse structure more like scipy.sparse`:

kind='integer' produces sparse structure more likescipy.sparse`:

In [62]: n=5; s=pd.Series([0]*5+[1]*5, dtype=int)
In [63]: ss=s.to_sparse(fill_value=0, kind='integer')
In [64]: ss
Out[64]: 
0    0
1    0
2    0
3    0
4    0
5    1
6    1
7    1
8    1
9    1
dtype: int32
IntIndex
Indices: array([5, 6, 7, 8, 9])

将其与默认的block进行对比:

contrast that with the default block:

dtype: int32
BlockIndex
Block locations: array([5])
Block lengths: array([5])

等效列稀疏矩阵可以用以下方法构建:

And equivalent column sparse matrix can be built with:

In [89]: data=ss.values
In [90]: data=ss.sp_values
In [91]: rows=ss.sp_index.indices
In [92]: cols=np.zeros_like(rows)
In [93]: sparse.csr_matrix((data,(rows,cols)))
Out[93]: 
<10x1 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>

有一个to_coo方法,但是它仅适用于更复杂的pd.MultiIndex对象(为什么?).

There is a to_coo method, but it only works with the more complex pd.MultiIndex object (why?).

这篇关于我该如何“稀疏"?在两个值上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆