数值处理二维数组的最快方法:数据框vs系列vs数组vs numba [英] Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

查看:121
本文介绍了数值处理二维数组的最快方法:数据框vs系列vs数组vs numba的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑以添加:我认为数字基准并不公平,请注意以下

Edit to add: I don't think the numba benchmarks are fair, notes below

我正在尝试对以下用例进行数值处理数据的不同方法的基准测试:

I'm trying to benchmark different approaches to numerically processing data for the following use case:

  1. 相当大的数据集(超过100,000条记录)
  2. 100多行相当简单的代码(z = x + y)
  3. 不需要排序或索引

换句话说,不需要序列和数据框的全部通用性,尽管它们在此处包含在b/c中,它们仍然是封装数据的便捷方法,并且通常需要进行预处理或后期处理才需要通用性超过numpy数组的熊猫.

In other words, the full generality of series and dataframes is not needed, although they are included here b/c they are still convenient ways to encapsulate the data and there is often pre- or post-processing that does require the generality of pandas over numpy arrays.

问题:根据此用例,以下基准是否合适?如果不可行,我该如何改进它们?

Question: Based on this use case, are the following benchmarks appropriate and if not, how can I improve them?

# importing pandas, numpy, Series, DataFrame in standard way
from numba import jit
nobs = 10000
nlines = 100

def proc_df():
   df = DataFrame({ 'x': np.random.randn(nobs),
                    'y': np.random.randn(nobs) })
   for i in range(nlines):
      df['z'] = df.x + df.y
   return df.z

def proc_ser():
   x = Series(np.random.randn(nobs))
   y = Series(np.random.randn(nobs))
   for i in range(nlines):
      z = x + y
   return z

def proc_arr():
   x = np.random.randn(nobs)
   y = np.random.randn(nobs)
   for i in range(nlines):
      z = x + y
   return z

@jit
def proc_numba():
   xx = np.random.randn(nobs)
   yy = np.random.randn(nobs)
   zz = np.zeros(nobs)
   for j in range(nobs):
      x, y = xx[j], yy[j]
      for i in range(nlines):
         z = x + y
      zz[j] = z
   return zz

结果(Win 7,使用3年的Xeon工作站(四核).标准和最近的anaconda发行版或非常接近.)

Results (Win 7, 3 year old Xeon workstation (quad-core). Standard and recent anaconda distribution or very close.)

In [1251]: %timeit proc_df()
10 loops, best of 3: 46.6 ms per loop

In [1252]: %timeit proc_ser()
100 loops, best of 3: 15.8 ms per loop

In [1253]: %timeit proc_arr()
100 loops, best of 3: 2.02 ms per loop

In [1254]: %timeit proc_numba()
1000 loops, best of 3: 1.04 ms per loop   # may not be valid result (see note below)

编辑以添加(响应jeff)将df/series/array传递到函数中而不是在函数内部创建它们(例如,将包含"randn"的代码行从函数内部移至函数)外部功能):

Edit to add (response to jeff) alternate results from passing df/series/array into functions rather than creating them inside of functions (i.e. move the code lines containing 'randn' from inside function to outside function):

10 loops, best of 3: 45.1 ms per loop
100 loops, best of 3: 15.1 ms per loop
1000 loops, best of 3: 1.07 ms per loop
100000 loops, best of 3: 17.9 µs per loop   # may not be valid result (see note below)

有关numba结果的提示:我认为numba编译器必须在for循环上进行优化,并将for循环减少为单个迭代.我不知道,但这是我能提出的唯一解释,因为它不可能比numpy快50倍,对吗?此处的后续问题:为什么numba比numpy快?

Note on numba results: I think the numba compiler must be optimizing on the for loop and reducing the for loop to a single iteration. I don't know that but it's the only explanation I can come up as it couldn't be 50x faster than numpy, right? Followup question here: Why is numba faster than numpy here?

推荐答案

好吧,您实际上并不是在计时相同的东西(或者,您正在计时不同的方面).

Well, you are not really timing the same things here (or rather, you are timing different aspects).

例如

In [6]:    x = Series(np.random.randn(nobs))

In [7]:    y = Series(np.random.randn(nobs))

In [8]:  %timeit x + y
10000 loops, best of 3: 131 µs per loop

In [9]:  %timeit Series(np.random.randn(nobs)) + Series(np.random.randn(nobs))
1000 loops, best of 3: 1.33 ms per loop

因此[8]乘以实际运算,而[9]包括序列创建(和随机数生成)的开销加上实际运算

So [8] times the actual operation, while [9] includes the overhead for the series creation (and the random number generation) PLUS the actual operation

另一个例子是proc_ser vs proc_df. proc_df包括在DataFrame中分配特定列的开销(这对于初始创建和随后的重新分配实际上是不同的).

Another example is proc_ser vs proc_df. The proc_df includes the overhead of assignement of a particular column in the DataFrame (which is actually different for an initial creation and subsequent reassignement).

因此,请创建结构(您也可以为其计时,但这是一个单独的问题).执行完全相同的操作并计时.

So create the structure (you can time that too, but that is a separate issue). Perform the exact same operation and time them.

您还说您不需要对齐.熊猫默认情况下会为您提供此功能(没有简单的方法可以将其关闭,尽管只需检查它们是否已对齐即可).在numba中时,您需要手动"对齐它们.

Further you say that you don't need alignment. Pandas gives you this by default (and no really easy way to turn it off, though its just a simple check if they are already aligned). While in numba you need to 'manually' align them.

这篇关于数值处理二维数组的最快方法:数据框vs系列vs数组vs numba的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆