我该如何“稀疏"?在两个值上? [英] How can I "sparsify" on two values?
问题描述
考虑熊猫系列s
n = 1000
s = pd.Series([0] * n + [1] * n, dtype=int)
s.memory_usage()
8080
我可以使用to_sparse
s.to_sparse(fill_value=0).memory_usage()
4080
但是我只有2种整数.我想我可以两次稀疏.有办法吗?
But I only have 2 types of integers. I'd think I could sparsify twice. Is there a way to do this?
推荐答案
由于您已用scipy
进行了标记,因此我将向您展示scipy.sparse
矩阵的样子:
Since you tagged this with scipy
, I'll show you what a scipy.sparse
matrix is like:
In [31]: n=100
In [32]: arr=np.array([[0]*n+[1]*n],int)
In [33]: M=sparse.csr_matrix(arr)
In [34]: M.data
Out[34]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [35]: M.indices
Out[35]:
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
191, 192, 193, 194, 195, 196, 197, 198, 199], dtype=int32)
In [36]: M.indptr
Out[36]: array([ 0, 100], dtype=int32)
它已将arr
的n
元素替换为2个数组,每个数组均包含n/2
元素.即使我将int
替换为uint8
,M.indices
数组仍将是int32
.
It has replaced the n
elements of arr
with 2 arrays each with n/2
elements. Even if I replace the int
with uint8
, the M.indices
array will still be int32
.
您的pandas
版本具有一半的内存使用情况,这表明它仅存储索引,并且有些人注意到data
部分全为1.但这只是一个猜测.
The fact that your pandas
version has half the memory usage,suggests that it is just storing the indices, and some how noting that the data
part is all 1s. But that's just a guess.
您期望更大程度的分散化吗?
How much greater sparification do you expect?
===================
====================
http://pandas.pydata.org/pandas-docs/stable /sparse.html
此示例看起来像熊猫正在实现某种运行"压缩:
This example looks like pandas is implementing some sort of 'run' compression:
In [4]: sts
Out[4]:
0 0.469112
1 -0.282863
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.861849
9 -2.104569
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)
它已识别出2个块,每个块的长度为2.它仍然必须将4个非填充值存储在某个数组中.
It has identified 2 blocks, of length 2 each. It still has to store the 4 nonfill values in some array.
等效于csr的稀疏(用于行数组):
A csr sparse equivalent (for a row array):
In [1052]: arr=np.random.rand(10)
In [1053]: arr[2:-2]=0
In [1055]: M=sparse.csr_matrix(arr)
In [1056]: M
Out[1056]:
<1x10 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [1057]: M.data
Out[1057]: array([ 0.37875012, 0.73703368, 0.7935645 , 0.22948213])
In [1058]: M.indices
Out[1058]: array([0, 1, 8, 9], dtype=int32)
In [1059]: M.indptr
Out[1059]: array([0, 4], dtype=int32)
如果填充值以块形式出现,则熊猫版本可能会更紧凑.但是我怀疑
The pandas version might be more compact if the fill values occur in blocks. But I suspect
0 1.0
1 1.0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 1.0
9 1.0
将产生相同的块.我看不到有证据表明它试图识别相同的1.0
值,并将其存储为值加计数.
would produce the same blocks. I don't see evidence of it's trying to identify the identical 1.0
values, and storing those as a value plus a count.
================
================
根据@MaxU
回答,您的ds存储了1000个1's
,以及两个单元素数组,告诉它们这些值的存储位置.
Based on @MaxU
answer your ds stores 1000 1's
, and two single element arrays that tell it where those values are stored.
In [56]: sp.memory_usage()
Out[56]: 1080
In [57]: sp.sp_index
Out[57]:
BlockIndex
Block locations: array([1000])
Block lengths: array([1000])
只要非填充值大范围出现,block
数组就会很小.但是,如果您在整个系列中分散了这1000个值,则将块的数量实质性地乘以
As long the nonfill values occur in big runs, the block
arrays will be small. But if you scattered those 1000 values through out the series, you'd multiply the number of blocks substantially
block locations: array([1,3,6,10,...])
block lengths: array([1,1,1,2,1,...])
我可以想象csr
布局和pandas块之间的映射,但是还没有弄清楚细节. csr
布局旨在与2d数组一起使用,具有清晰的行和列概念.看起来稀疏的数据框只包含稀疏序列对象.
I can imagine mapping between the csr
layout and the pandas blocks, but haven't worked out the details. The csr
layout is meant to work with 2d arrays, with a clear concept of rows and columns. Looks like a sparse dataframe just contains sparse series objects.
==================
===================
https://stackoverflow.com/a/38157234/901925 显示了如何从稀疏数据帧值映射到稀疏矩阵.对于每个列(数据系列),它使用sp_values
,fill_value
,sp_index
.
https://stackoverflow.com/a/38157234/901925 shows how to map from sparse dataframe values to a scipy sparse matrix. For each column (data series) it uses sp_values
,fill_value
,sp_index
.
pandas/pandas/sparse/scipy_sparse.py
具有scipy稀疏与数据序列之间交互的代码.
pandas/pandas/sparse/scipy_sparse.py
has the code for interaction between scipy sparse and data series.
==================
===================
kind='integer' produces sparse structure more like
scipy.sparse`:
kind='integer' produces sparse structure more like
scipy.sparse`:
In [62]: n=5; s=pd.Series([0]*5+[1]*5, dtype=int)
In [63]: ss=s.to_sparse(fill_value=0, kind='integer')
In [64]: ss
Out[64]:
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 1
8 1
9 1
dtype: int32
IntIndex
Indices: array([5, 6, 7, 8, 9])
将其与默认的block
进行对比:
contrast that with the default block
:
dtype: int32
BlockIndex
Block locations: array([5])
Block lengths: array([5])
等效列稀疏矩阵可以用以下方法构建:
And equivalent column sparse matrix can be built with:
In [89]: data=ss.values
In [90]: data=ss.sp_values
In [91]: rows=ss.sp_index.indices
In [92]: cols=np.zeros_like(rows)
In [93]: sparse.csr_matrix((data,(rows,cols)))
Out[93]:
<10x1 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in Compressed Sparse Row format>
有一个to_coo
方法,但是它仅适用于更复杂的pd.MultiIndex
对象(为什么?).
There is a to_coo
method, but it only works with the more complex pd.MultiIndex
object (why?).
这篇关于我该如何“稀疏"?在两个值上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!