如何将大量数据帧列传递给 numpy 矢量化作为参数 [英] How to pass a large number of dataframe columns to numpy vectorize as argument

查看:22
本文介绍了如何将大量数据帧列传递给 numpy 矢量化作为参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 31 列,例如 100 行的数据框.

I've got a dataframe with exactly 31 columns and, for example, 100 rows.

我需要创建一个包含 100 个字典的列表,这些字典的值来自不同的 31 列.

I need to create a list with 100 dictionaries that have values processed from the different 31 columns.

我目前正在使用 apply() 函数来做到这一点:

I am currently using apply() function to do this:

my_df.apply(lambda row: _build_data(row, param1, param2, param3), axis=1)

但现在我想探索 numpy vectorize() 的可能性.问题是,根据我正在阅读的内容,我应该将每一列作为单独的参数传递给它:

But now I want to explore the numpy vectorize() Possibilities. The problem is, from what I'm reading, I should pass each column to it as a separate argument:

np.vectorize(_build_data)(my_df[col1], my_df[col2], ..., my_df[col31], param1, param2, param3)

这看起来不像 Pythonic,我也不想定义一个有 34 个参数的函数.

This does not look pythonic, nor do I want to have to define a function with 34 arguments.

你知道有没有其他方法可以做到这一点?

Do you know if there is another way to do this?

非常感谢您的帮助!

推荐答案

我怀疑你试图使用 np.vectorize 因为你读到 numpy 'vectorization' 是一种加速 pandas 代码.

I suspect you were trying to use np.vectorize because you read that numpy 'vectorization' is a way of speeding up pandas code.

In [29]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])                  
In [30]: df                                                                                    
Out[30]: 
   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

逐行缓慢取行的方法是:

The slow, row by row, approach to taking the row mean:

In [31]: df.apply(lambda row: np.mean(row), axis=1)                                            
Out[31]: 
0     1.0
1     4.0
2     7.0
3    10.0
dtype: float64

快速的numpy方法:

The fast numpy method:

In [32]: df.to_numpy()                                                                         
Out[32]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
In [33]: df.to_numpy().mean(axis=1)                                                            
Out[33]: array([ 1.,  4.,  7., 10.])

也就是说,我们得到一个数据帧值的数组,并使用快速编译的方法来计算行均值.

That is, we get an array of the dataframe values, and use a fast compiled method to calculate row means.

但是要为每一行制作一个类似于字典的东西:

But to make something like a dictionary for each row:

In [35]: df.apply(lambda row: {str(k):k for k in row}, axis=1)                                 
Out[35]: 
0        {'0': 0, '1': 1, '2': 2}
1        {'3': 3, '4': 4, '5': 5}
2        {'6': 6, '7': 7, '8': 8}
3    {'9': 9, '10': 10, '11': 11}
dtype: object

我们必须对数组行进行迭代,就像我们对数据框所做的那样apply:

We have to iterate on array rows, just like we do with the dataframe apply:

In [36]: [{str(k):k for k in row} for row in df.to_numpy()]                                    
Out[36]: 
[{'0': 0, '1': 1, '2': 2},
 {'3': 3, '4': 4, '5': 5},
 {'6': 6, '7': 7, '8': 8},
 {'9': 9, '10': 10, '11': 11}]

数组方法更快:

In [37]: timeit df.apply(lambda row: {str(k):k for k in row}, axis=1)                          
1.13 ms ± 702 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [38]: timeit [{str(k):k for k in row} for row in df.to_numpy()]                             
40.8 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

但是 apply 方法返回一个数据帧,而不是一个列表.我怀疑大部分额外时间都在这一步.

But the apply method returns a dataframe, not a list. I suspect most of the extra time is in that step.

np.vectorize(和 np.frompyfunc)也可用于迭代数组,但默认是迭代元素,而不是行或列.一般来说,它们比更明确的迭代慢(就像我在 [36] 中所做的那样).

np.vectorize (and np.frompyfunc) can also be used to iterate on an array, but the default is to iterate on elements, not rows or columns. In general they are slower than the more explicit iteration (as I do in [36]).

从列表中制作数据框的笨拙方法:

A clumsy way of making a dataframe from the list:

In [53]: %%timeit 
    ...: df1 = pd.DataFrame(['one','two','three','four'],columns=['d'])   
    ...: df1['d'] =[{str(k):k for k in row} for row in df.to_numpy()]                                                                                       
572 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这篇关于如何将大量数据帧列传递给 numpy 矢量化作为参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆