将 pandas 数据帧矢量化为Numpy数组 [英] Vectorize Pandas Dataframe into Numpy Array

查看:97
本文介绍了将 pandas 数据帧矢量化为Numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题,我需要将熊猫数据框转换为列表列表数组.

I have a problem where I need to convert a pandas dataframe into an array of list of lists.

示例:

import pandas as pd
df = pd.DataFrame([[1,2,3],[2,2,4],[3,2,4]])

我知道有一个as_matrix()函数,它会在下面返回:

I know there is the as_matrix() function which returns below:

df.as_matrix():
# result:array([[1, 2, 3],
                [2, 2, 4],
                [3, 2, 4]])

但是,我需要这种格式的东西

However, I require something in this format

  [array([[1], [2], [3]]),
   array([[2], [2], [4]],
   array([[3], [2], [4]])]

IE.我需要一个包含列表列表的数组列表,其中最里面的列表包含单个元素,而数组中最外面的列表表示数据帧的行.这样做的效果是,它基本上将数据帧的每一行矢量化为尺寸为3的矢量.

IE. I need a list of arrays containing list of lists where the inner most list contains a single element and the outer most list in the array represents the row of the dataframe. The effect of this is that it basically vectorizes each row of the dataframe into a vector of dimension 3.

这特别有用,特别是当我需要以numpy进行矩阵/矢量运算并且当前拥有的数据源为.csv格式并且正在努力寻找一种将数据帧转换为矢量的方法时.

This is useful especially when I need to do matrix / vector operations in numpy and currently the data source I have is in .csv format and I am struggling to find a way to convert a dataframe into a vector.

任何帮助将不胜感激.

推荐答案

提取基础数组数据,沿最后一个添加新轴,然后使用np.vsplit-

Extract the underlying array data , add a newaxis along the last one and then split along the first axis with np.vsplit -

np.vsplit(df.values[...,None],df.shape[0])

样品运行-

In [327]: df
Out[327]: 
   0  1  2
0  1  2  3
1  2  2  4
2  3  2  4

In [328]: expected_output = [np.array([[1], [2], [3]]),
     ...: np.array([[2], [2], [4]]),
     ...: np.array([[3], [2], [4]])]

In [329]: expected_output
Out[329]: 
[array([[1],
        [2],
        [3]]), array([[2],
        [2],
        [4]]), array([[3],
        [2],
        [4]])]

In [330]: np.vsplit(df.values[...,None],df.shape[0])
Out[330]: 
[array([[[1],
         [2],
         [3]]]), array([[[2],
         [2],
         [4]]]), array([[[3],
         [2],
         [4]]])]

如果您正在使用NumPy函数,那么在大多数情况下,您应该可以消除拆分并直接使用扩展数组版本.

If you are working with NumPy funcs, then in most scenarios, you should be able to do away with the splitting and directly use the extended array version.

现在,在幕后np.vsplit 使用了np.array_split,这基本上是一个循环.因此,更有效的方法是避免函数开销,就像这样-

Now, under the hoods np.vsplit makes use of np.array_split and that's basically a loop. So, a bit more performant way would be to avoid the function overhead, like so -

np.array_split(df.values[...,None],df.shape[0])

请注意,这将比预期输出中列出的尺寸多一维.如果需要压缩版本,可以在新轴扩展数组版本上使用列表推导,例如-

Note that this would have one extra dimension than as listed in the expected output. If you want a squeezed out version, we could use a list comprehension on the new-axis extended array version, like so -

In [357]: [i for i in df.values[...,None]]
Out[357]: 
[array([[1],
        [2],
        [3]]), array([[2],
        [2],
        [4]]), array([[3],
        [2],
        [4]])]

因此,另一种方法是在循环中添加新轴-

Thus, another way would be to add the new axis within the looping -

[i[...,None] for i in df.values]

这篇关于将 pandas 数据帧矢量化为Numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆