在 Pandas 中有效地创建稀疏数据透视表? [英] Efficiently create sparse pivot tables in pandas?
问题描述
我正在将包含两列(A 和 B)的记录列表转换为矩阵表示.我一直在熊猫中使用枢轴函数,但结果最终相当大.pandas 是否支持转为稀疏格式?我知道我可以旋转它,然后将它变成某种稀疏表示,但并不像我想要的那么优雅.我的最终目标是将其用作预测模型的输入.
I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.
或者,在 Pandas 之外是否有某种稀疏枢轴能力?
Alternatively, is there some kind of sparse pivot capability outside of pandas?
这是一个非稀疏枢轴的示例
edit: here is an example of a non-sparse pivot
import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]
frame
person thing count
0 me a 1
1 you a 1
2 him b 1
3 you c 1
4 him d 1
5 me d 1
frame.pivot('person','thing')
count
thing a b c d
person
him NaN 1 NaN 1
me 1 NaN NaN 1
you 1 NaN 1 NaN
这将创建一个矩阵,该矩阵可以包含人和事物的所有可能组合,但它并不稀疏.
This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.
http://docs.scipy.org/doc/scipy/reference/sparse.html一个>
稀疏矩阵占用较少的空间,因为它们可以暗示 NaN 或 0 之类的东西.如果我有一个非常大的数据集,这个旋转函数可以生成一个矩阵,由于大量的 NaN 或 0,该矩阵应该是稀疏的.我希望通过立即生成稀疏矩阵而不是创建密集矩阵然后将其转换为稀疏矩阵来节省大量空间/内存.
Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.
推荐答案
@khammel 之前发布的答案很有用,但不幸的是由于 pandas 和 Python 的变化而不再有效.以下应该产生相同的输出:
The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)
row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)),
shape=(person_c.categories.size, thing_c.categories.size))
>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]], dtype=int64)
dfs = pd.SparseDataFrame(sparse_matrix,
index=person_c.categories,
columns=thing_c.categories,
default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
主要变化是:
.astype()
不再接受分类".您必须创建一个 CategoricalDtype 对象.sort()
不再起作用
.astype()
no longer accepts "categorical". You have to create a CategoricalDtype object.sort()
doesn't work anymore
其他变化更为肤浅:
- 使用类别大小而不是唯一的 Series 对象的长度,只是因为我不想不必要地创建另一个对象
csr_matrix
(frame["count"]
) 的数据输入不需要是列表对象- pandas
SparseDataFrame
现在直接接受一个 scipy.sparse 对象
- using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
- the data input for the
csr_matrix
(frame["count"]
) doesn't need to be a list object - pandas
SparseDataFrame
accepts a scipy.sparse object directly now
这篇关于在 Pandas 中有效地创建稀疏数据透视表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!