如何在不复制的情况下从单个1D Numpy数组构造Pandas DataFrame [英] How can I construct a Pandas DataFrame from individual 1D Numpy arrays without copying

查看:98
本文介绍了如何在不复制的情况下从单个1D Numpy数组构造Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与我能找到的其他所有问题不同,我不想从同构的Numpy数组创建DataFrame,也不想将结构化数组转换为DataFrame.

Unlike every other question I can find, I do not want to create a DataFrame from a homogeneous Numpy array, nor do I want to convert a structured array into a DataFrame.

我想要的是为每列从单个1D Numpy数组创建一个DataFrame.我尝试了明显的DataFrame({"col": nparray, "col": nparray}),但是它显示在我的个人资料的顶部,因此它的运行速度确实很慢.

What I want is to create a DataFrame from individual 1D Numpy arrays for each column. I tried the obvious DataFrame({"col": nparray, "col": nparray}), but this shows up at the top of my profile, so it must be doing something really slow.

据我了解,Pandas DataFrames是用纯Python实现的,每个列均由Numpy数组支持,因此我认为有一种有效的方法.

It is my understanding that Pandas DataFrames are implemented in pure Python, where each column is backed by a Numpy array, so I would think there is an efficient way to do it.

我实际上想做的是从Cython高效地填充DataFrame. Cython具有允许有效访问Numpy阵列的内存视图. 因此,我的策略是分配一个Numpy数组,将其填充数据,然后将其放入DataFrame中.

What I'm actually trying to do is to fill a DataFrame efficiently from Cython. Cython has memoryviews that allow efficient access to Numpy arrays. So my strategy is to allocate a Numpy array, fill it with data and then put it in a DataFrame.

相反的效果很好,可以从Pandas DataFrame创建一个memoryview.因此,如果有一种方法可以预分配整个DataFrame,然后仅将列传递给Cython,那么这也是可以接受的.

The opposite works quite fine, creating a memoryview from a Pandas DataFrame. So if there is a way to preallocate the entire DataFrame and then just pass the columns to Cython, this is also acceptable.

cdef int32_t[:] data_in = df['data_in'].to_numpy(dtype="int32")

我的代码配置文件的一部分如下所示,其中通过在最后创建DataFrame完全使代码所做的一切相形见

A section of the profile of my code looks like this, where everything the code does is completely dwarfed by creating the DataFrame at the end.

         1100546 function calls (1086282 primitive calls) in 4.345 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.345    4.345 profile:0(<code object <module> at 0x7f4e693d1c90, file "test.py", line 1>)
    445/1    0.029    0.000    4.344    4.344 :0(exec)
        1    0.006    0.006    4.344    4.344 test.py:1(<module>)
     1000    0.029    0.000    2.678    0.003 :0(run_df)
     1001    0.017    0.000    2.551    0.003 frame.py:378(__init__)
     1001    0.018    0.000    2.522    0.003 construction.py:170(init_dict)

对应的代码:

def run_df(self, df):
    cdef int arx_rows = len(df)
    cdef int arx_idx

    cdef int32_t[:] data_in = df['data_in'].to_numpy(dtype="int32")

    data_out_np = np.zeros(arx_rows, dtype="int32")
    cdef int32_t[:] data_out = data_out_np

    for arx_idx in range(arx_rows):
        self.cpp_sec_par.run(data_in[arx_idx],data_out[arx_idx],)

    return pd.DataFrame({
        'data_out': data_out_np,
    })

推荐答案

我认为这不能完全回答问题,但可能会有所帮助.

I don't think this fully answers the question but it might help.

1-直接从2D数组初始化数据帧时,不会创建副本.

1-when you initialize your dataframe directly from 2D array, a copy is not made.

2-您没有2D阵列,您有1D阵列,如何从1D阵列中获得2D阵列而不进行复制,我不知道.

2-you don't have 2D arrays, you have 1D arrays, how do you get 2D arrays from 1D arrays without making copies, I don't know.

要说明要点,请参见下文:

To illustrate the points, see below:

a = np.array([1,2,3])
b = np.array([4,5,6])
c = np.array((a,b))
df = pd.DataFrame(c)
a = np.array([1,2,3])
b = np.array([4,5,6])
c = np.array((a,b))
df = pd.DataFrame(c)

print(c)
[[1 2 3]
 [4 5 6]]

print(df)
   0  1  2
0  1  2  3
1  4  5  6

c[1,1]=10
print(df)
   0   1  2
0  1   2  3
1  4  10  6

因此,更改 c 确实会更改 df .但是,如果您尝试更改 a b ,则不会影响 c(或df).

So, changing c indeed changes df. However if you try changing a or b, that does not affect c (or df).

这篇关于如何在不复制的情况下从单个1D Numpy数组构造Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆