如何恢复平坦的Numpy数组的原始索引? [英] How to recover original indices for a flattened Numpy array?
问题描述
我有一个多维numpy数组,试图将其粘贴到熊猫数据框中.我想展平数组,并创建一个反映预先展平的数组索引的熊猫索引.
I've got a multidimensional numpy array that I'm trying to stick into a pandas data frame. I'd like to flatten the array, and create a pandas index that reflects the pre-flattened array indices.
请注意,我使用3D来缩小示例范围,但我想将其推广到至少4D
A = np.random.rand(2,3,4)
array([[[ 0.43793885, 0.40078139, 0.48078691, 0.05334248],
[ 0.76331509, 0.82514441, 0.86169078, 0.86496111],
[ 0.75572665, 0.80860943, 0.79995337, 0.63123724]],
[[ 0.20648946, 0.57042315, 0.71777265, 0.34155005],
[ 0.30843717, 0.39381407, 0.12623462, 0.93481552],
[ 0.3267771 , 0.64097038, 0.30405215, 0.57726629]]])
df = pd.DataFrame(A.flatten())
我正在尝试生成x/y/z列,如下所示:
I'm trying to generate x/y/z columns like this:
A z y x
0 0.437939 0 0 0
1 0.400781 0 0 1
2 0.480787 0 0 2
3 0.053342 0 0 3
4 0.763315 0 1 0
5 0.825144 0 1 1
6 0.861691 0 1 2
7 0.864961 0 1 3
...
21 0.640970 1 2 1
22 0.304052 1 2 2
23 0.577266 1 2 3
我尝试使用np.meshgrid
进行设置,但是我在某个地方出错了:
I've tried setting this up using np.meshgrid
but I'm going wrong somewhere:
dimnames = ['z', 'y', 'x']
ranges = [ np.arange(x) for x in A.shape ]
ix = [ x.flatten() for x in np.meshgrid(*ranges) ]
for name, col in zip(dimnames, ix):
df[name] = col
df = df.set_index(dimnames).squeeze()
这个结果看起来有些明智,但是索引是错误的:
This result looks somewhat sensible, but the indices are wrong:
df
z y x
0 0 0 0.437939
1 0.400781
2 0.480787
3 0.053342
1 0 0 0.763315
1 0.825144
2 0.861691
3 0.864961
0 1 0 0.755727
1 0.808609
2 0.799953
3 0.631237
1 1 0 0.206489
1 0.570423
2 0.717773
3 0.341550
0 2 0 0.308437
1 0.393814
2 0.126235
3 0.934816
1 2 0 0.326777
1 0.640970
2 0.304052
3 0.577266
print A[0,1,0]
0.76331508999999997
print print df.loc[0,1,0]
0.75572665000000006
如何创建索引列以反映A
的形状?
How can I create the index columns to reflect the shape of A
?
推荐答案
You could use pd.MultiIndex.from_product
:
import numpy as np
import pandas as pd
import string
def using_multiindex(A, columns):
shape = A.shape
index = pd.MultiIndex.from_product([range(s)for s in shape], names=columns)
df = pd.DataFrame({'A': A.flatten()}, index=index).reset_index()
return df
A = np.array([[[ 0.43793885, 0.40078139, 0.48078691, 0.05334248],
[ 0.76331509, 0.82514441, 0.86169078, 0.86496111],
[ 0.75572665, 0.80860943, 0.79995337, 0.63123724]],
[[ 0.20648946, 0.57042315, 0.71777265, 0.34155005],
[ 0.30843717, 0.39381407, 0.12623462, 0.93481552],
[ 0.3267771 , 0.64097038, 0.30405215, 0.57726629]]])
df = using_multiindex(A, list('ZYX'))
收益
Z Y X A
0 0 0 0 0.437939
1 0 0 1 0.400781
2 0 0 2 0.480787
3 0 0 3 0.053342
...
21 1 2 1 0.640970
22 1 2 2 0.304052
23 1 2 3 0.577266
如果将性能放在首位,请考虑使用发件人的cartesian_product
. (请参见下面的代码.)
Or if performance is a top priority, consider using senderle's cartesian_product
. (See the code, below.)
以下是形状为(100,100,100)的A的基准:
Here is a benchmark for A with shape (100, 100, 100):
In [321]: %timeit using_cartesian_product(A, columns)
100 loops, best of 3: 13.8 ms per loop
In [318]: %timeit using_multiindex(A, columns)
10 loops, best of 3: 35.6 ms per loop
In [320]: %timeit indices_merged_arr_generic(A, columns)
10 loops, best of 3: 29.1 ms per loop
In [319]: %timeit using_product(A)
1 loop, best of 3: 461 ms per loop
这是我用于基准测试的设置:
This is the setup I used for the benchmark:
import numpy as np
import pandas as pd
import functools
import itertools as IT
import string
product = IT.product
def cartesian_product_broadcasted(*arrays):
"""
http://stackoverflow.com/a/11146645/190597 (senderle)
"""
broadcastable = np.ix_(*arrays)
broadcasted = np.broadcast_arrays(*broadcastable)
dtype = np.result_type(*arrays)
rows, cols = functools.reduce(np.multiply, broadcasted[0].shape), len(broadcasted)
out = np.empty(rows * cols, dtype=dtype)
start, end = 0, rows
for a in broadcasted:
out[start:end] = a.reshape(-1)
start, end = end, end + rows
return out.reshape(cols, rows).T
def using_cartesian_product(A, columns):
shape = A.shape
coords = cartesian_product_broadcasted(*[np.arange(s, dtype='int') for s in shape])
df = pd.DataFrame(coords, columns=columns)
df['A'] = A.flatten()
return df
def using_multiindex(A, columns):
shape = A.shape
index = pd.MultiIndex.from_product([range(s)for s in shape], names=columns)
df = pd.DataFrame({'A': A.flatten()}, index=index).reset_index()
return df
def indices_merged_arr_generic(arr, columns):
n = arr.ndim
grid = np.ogrid[tuple(map(slice, arr.shape))]
out = np.empty(arr.shape + (n+1,), dtype=arr.dtype)
for i in range(n):
out[...,i] = grid[i]
out[...,-1] = arr
out.shape = (-1,n+1)
df = pd.DataFrame(out, columns=['A']+columns)
return df
def using_product(A):
x, y, z = A.shape
x_, y_, z_ = zip(*product(range(x), range(y), range(z)))
df = pd.DataFrame(A.flatten()).assign(x=x_, y=y_, z=z_)
return df
A = np.random.random((100,100,100))
shape = A.shape
columns = list(string.ascii_uppercase[-len(shape):][::-1])
这篇关于如何恢复平坦的Numpy数组的原始索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!