如何恢复平坦的Numpy数组的原始索引? [英] How to recover original indices for a flattened Numpy array?

查看:65
本文介绍了如何恢复平坦的Numpy数组的原始索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个多维numpy数组,试图将其粘贴到熊猫数据框中.我想展平数组,并创建一个反映预先展平的数组索引的熊猫索引.

I've got a multidimensional numpy array that I'm trying to stick into a pandas data frame. I'd like to flatten the array, and create a pandas index that reflects the pre-flattened array indices.

请注意,我使用3D来缩小示例范围,但我想将其推广到至少4D

A = np.random.rand(2,3,4)
array([[[ 0.43793885,  0.40078139,  0.48078691,  0.05334248],
    [ 0.76331509,  0.82514441,  0.86169078,  0.86496111],
    [ 0.75572665,  0.80860943,  0.79995337,  0.63123724]],

   [[ 0.20648946,  0.57042315,  0.71777265,  0.34155005],
    [ 0.30843717,  0.39381407,  0.12623462,  0.93481552],
    [ 0.3267771 ,  0.64097038,  0.30405215,  0.57726629]]])

df = pd.DataFrame(A.flatten())

我正在尝试生成x/y/z列,如下所示:

I'm trying to generate x/y/z columns like this:

           A  z  y  x
0   0.437939  0  0  0
1   0.400781  0  0  1
2   0.480787  0  0  2
3   0.053342  0  0  3
4   0.763315  0  1  0
5   0.825144  0  1  1
6   0.861691  0  1  2
7   0.864961  0  1  3
...
21  0.640970  1  2  1
22  0.304052  1  2  2
23  0.577266  1  2  3

我尝试使用np.meshgrid进行设置,但是我在某个地方出错了:

I've tried setting this up using np.meshgrid but I'm going wrong somewhere:

dimnames = ['z', 'y', 'x']
ranges   = [ np.arange(x) for x in A.shape ]
ix       = [ x.flatten()  for x in np.meshgrid(*ranges) ]
for name, col in zip(dimnames, ix):
    df[name] = col
df = df.set_index(dimnames).squeeze()

这个结果看起来有些明智,但是索引是错误的:

This result looks somewhat sensible, but the indices are wrong:

df
z  y  x
0  0  0    0.437939
      1    0.400781
      2    0.480787
      3    0.053342
1  0  0    0.763315
      1    0.825144
      2    0.861691
      3    0.864961
0  1  0    0.755727
      1    0.808609
      2    0.799953
      3    0.631237
1  1  0    0.206489
      1    0.570423
      2    0.717773
      3    0.341550
0  2  0    0.308437
      1    0.393814
      2    0.126235
      3    0.934816
1  2  0    0.326777
      1    0.640970
      2    0.304052
      3    0.577266

print A[0,1,0]
0.76331508999999997

print print df.loc[0,1,0]
0.75572665000000006

如何创建索引列以反映A的形状?

How can I create the index columns to reflect the shape of A?

推荐答案

您可以使用

You could use pd.MultiIndex.from_product:

import numpy as np
import pandas as pd
import string

def using_multiindex(A, columns):
    shape = A.shape
    index = pd.MultiIndex.from_product([range(s)for s in shape], names=columns)
    df = pd.DataFrame({'A': A.flatten()}, index=index).reset_index()
    return df

A = np.array([[[ 0.43793885,  0.40078139,  0.48078691,  0.05334248],
    [ 0.76331509,  0.82514441,  0.86169078,  0.86496111],
    [ 0.75572665,  0.80860943,  0.79995337,  0.63123724]],

   [[ 0.20648946,  0.57042315,  0.71777265,  0.34155005],
    [ 0.30843717,  0.39381407,  0.12623462,  0.93481552],
    [ 0.3267771 ,  0.64097038,  0.30405215,  0.57726629]]])

df = using_multiindex(A, list('ZYX'))

收益

    Z  Y  X         A
0   0  0  0  0.437939
1   0  0  1  0.400781
2   0  0  2  0.480787
3   0  0  3  0.053342
...
21  1  2  1  0.640970
22  1  2  2  0.304052
23  1  2  3  0.577266

如果将性能放在首位,请考虑使用发件人的cartesian_product . (请参见下面的代码.)

Or if performance is a top priority, consider using senderle's cartesian_product. (See the code, below.)

以下是形状为(100,100,100)的A的基准:

Here is a benchmark for A with shape (100, 100, 100):

In [321]: %timeit  using_cartesian_product(A, columns)
100 loops, best of 3: 13.8 ms per loop

In [318]: %timeit using_multiindex(A, columns)
10 loops, best of 3: 35.6 ms per loop

In [320]: %timeit indices_merged_arr_generic(A, columns)
10 loops, best of 3: 29.1 ms per loop

In [319]: %timeit using_product(A)
1 loop, best of 3: 461 ms per loop


这是我用于基准测试的设置:


This is the setup I used for the benchmark:

import numpy as np
import pandas as pd
import functools
import itertools as IT
import string
product = IT.product

def cartesian_product_broadcasted(*arrays):
    """
    http://stackoverflow.com/a/11146645/190597 (senderle)
    """
    broadcastable = np.ix_(*arrays)
    broadcasted = np.broadcast_arrays(*broadcastable)
    dtype = np.result_type(*arrays)
    rows, cols = functools.reduce(np.multiply, broadcasted[0].shape), len(broadcasted)
    out = np.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows).T

def using_cartesian_product(A, columns):
    shape = A.shape
    coords = cartesian_product_broadcasted(*[np.arange(s, dtype='int') for s in shape])
    df = pd.DataFrame(coords, columns=columns)
    df['A'] = A.flatten()
    return df

def using_multiindex(A, columns):
    shape = A.shape
    index = pd.MultiIndex.from_product([range(s)for s in shape], names=columns)
    df = pd.DataFrame({'A': A.flatten()}, index=index).reset_index()
    return df

def indices_merged_arr_generic(arr, columns):
    n = arr.ndim
    grid = np.ogrid[tuple(map(slice, arr.shape))]
    out = np.empty(arr.shape + (n+1,), dtype=arr.dtype)
    for i in range(n):
        out[...,i] = grid[i]
    out[...,-1] = arr
    out.shape = (-1,n+1)
    df = pd.DataFrame(out, columns=['A']+columns)
    return df

def using_product(A):
    x, y, z = A.shape
    x_, y_, z_ = zip(*product(range(x), range(y), range(z)))
    df = pd.DataFrame(A.flatten()).assign(x=x_, y=y_, z=z_)
    return df

A = np.random.random((100,100,100))
shape = A.shape
columns = list(string.ascii_uppercase[-len(shape):][::-1])

这篇关于如何恢复平坦的Numpy数组的原始索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆