在 pandas 中有效地创建稀疏数据透视表? [英] Efficiently create sparse pivot tables in pandas?
问题描述
我正在将具有两列(A和B)的记录列表转换成矩阵表示形式.我一直在熊猫中使用数据透视功能,但结果最终变得相当大.大熊猫支持枢轴化为稀疏格式吗?我知道我可以先将其旋转,然后将其转换为某种稀疏表示形式,但并不像我想要的那样优雅.我的最终目标是将其用作预测模型的输入.
I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.
或者,熊猫之外是否存在某种稀疏的枢轴能力?
Alternatively, is there some kind of sparse pivot capability outside of pandas?
这是一个非稀疏枢轴的示例
edit: here is an example of a non-sparse pivot
import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]
frame
person thing count
0 me a 1
1 you a 1
2 him b 1
3 you c 1
4 him d 1
5 me d 1
frame.pivot('person','thing')
count
thing a b c d
person
him NaN 1 NaN 1
me 1 NaN NaN 1
you 1 NaN 1 NaN
这将创建一个矩阵,其中可以包含人与物的所有可能组合,但并不稀疏.
This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.
http://docs.scipy.org/doc/scipy/reference/sparse.html
稀疏矩阵占用较少的空间,因为它们可以表示类似NaN或0的东西.如果我有非常大的数据集,则该枢轴函数可以生成一个矩阵,由于NaN或0的数量很多,因此该矩阵应该是稀疏的.我希望可以通过立即生成一些稀疏的东西而不是创建一个密集的矩阵然后将其转换为稀疏的东西来节省大量空间/内存.
Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.
推荐答案
@khammel先前发布的答案很有用,但不幸的是,由于熊猫和Python的变化,该答案不再有效.以下应该产生相同的输出:
The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)
row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
shape=(person_c.categories.size, thing_c.categories.size))
>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]], dtype=int64)
dfs = pd.SparseDataFrame(sparse_matrix, \
index=person_c.categories, \
columns=thing_c.categories, \
default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
主要更改是:
-
.astype()
不再接受分类".您必须创建一个CategoricalDtype对象. -
sort()
不再起作用
.astype()
no longer accepts "categorical". You have to create a CategoricalDtype object.sort()
doesn't work anymore
其他更改比较肤浅:
- 使用类别大小而不是唯一的Series对象的长度,只是因为我不想不必要地制作另一个对象
-
csr_matrix
(frame["count"]
)的数据输入不必是列表对象 - pandas
SparseDataFrame
现在直接接受scipy.sparse对象
- using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
- the data input for the
csr_matrix
(frame["count"]
) doesn't need to be a list object - pandas
SparseDataFrame
accepts a scipy.sparse object directly now
这篇关于在 pandas 中有效地创建稀疏数据透视表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!