将函数应用于 pandas 中的列集,逐列“遍历"整个数据框 [英] Apply function to sets of columns in pandas, 'looping' over entire data frame column-wise

查看:69
本文介绍了将函数应用于 pandas 中的列集,逐列“遍历"整个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是一个测试示例,以显示我要实现的目标.这是一个玩具数据框:

Here is a test example to show what I am trying to achieve. Here's a toy data frame:

df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

哪个给

    Time       A_x       A_y       A_z       B_x       B_y       B_z
1  -0.075509 -0.123527 -0.547239 -0.453707 -0.969796  0.248761  1.369613
2  -0.206369 -0.112098 -1.122609  0.218538 -0.878985  0.566872 -1.048862
3  -0.194552  0.818276 -1.563931  0.097377  1.641384 -0.766217 -1.482096
4   0.502731  0.766515 -0.650482 -0.087203 -0.089075  0.443969  0.354747
5   1.411380 -2.419204 -0.882383  0.005204 -0.204358 -0.999242 -0.395236
6   1.036695  1.115630  0.081825 -1.038442  0.515798 -0.060016  2.669702
7   0.392943  0.226386  0.039879  0.732611 -0.073447  1.164285  1.034357
8  -1.253264  0.389148  0.158289  0.440282 -1.195860  0.872064  0.906377
9  -0.133580 -0.308314 -0.839347 -0.517989  0.652120  0.477232 -0.391767
10  0.623841  0.473552  0.059428  0.726088 -0.593291 -3.186297 -0.846863

我想做的只是在这种情况下,对于每个索引,为每个标头(A和B)计算向量的长度,并除以Time列.因此,此函数必须为np.sqrt(A_x^2 + A_y^2 + A_z^2),并且对于B当然应该相同. IE.我希望计算每一行的速度,但是三列会产生一个速度结果.

What I want to do is simply to calculate the length of the vector for each header (A and B) in this case, for each index, and divide by the Time column. Hence, this function needs to be np.sqrt(A_x^2 + A_y^2 + A_z^2) and the same for B of course. I.e. I am looking to calculate the velocity for each row, but three columns contribute to one velocity result.

我曾尝试使用df.groupbydf.filter遍历列,但是我无法真正使它正常工作,因为我不确定我如何有效地将相同的功能应用于数据帧的块,一劳永逸(显然,一是避免循环遍历行).我已经尝试过

I have tried using df.groupby and df.filter to loop-over the columns but I cannot really get it to work, because I am not at all sure how I apply effectively the same function to chunks of the data-frame, all in one go (as apparently one is to avoid looping over rows). I have tried doing

df = df.apply(lambda x: np.sqrt(x.dot(x)), axis=1)

这当然有效,但是仅当输入数据帧具有正确的列数(3)时,如果更长,则将在整行中计算点积,而不是在三列的块中计算这就是我所需要的想要(因为这是对应于标签坐标的转弯,它是三维的).

This works of course, but only if the input data frame has the right number of columns (3), if longer then the dot-product is calculated over the entire row, and not in chunks of three columns which is what I want (because this is turns corresponds to the tag coordinates, which are three dimensional).

所以这就是我最终要在上面的示例中得到的结果(下面的数组只是填充有随机数,而不是我要计算的实际速度-只是为了说明我要实现的形状):

So this is what I am eventually trying to get with the above example (the below arrays are just filled with random numbers, not the actual velocities which I am trying to calculate - just to show what sort of shape I trying to achieve):

     Velocity_A  Velocity_B
1    -0.975633   -2.669544
2     0.766405   -0.264904
3     0.425481   -0.429894
4    -0.437316    0.954006
5     1.073352   -1.475964
6    -0.647534    0.937035
7     0.082517    0.438112
8    -0.387111   -1.417930
9    -0.111011    1.068530
10    0.451979   -0.053333

我的实际数据是50,000 x 36(所以有12个带有x,y,z坐标的标签),我想一次计算所有速度以避免迭代(如果可能的话).还有一个长度相同的时间列(50,000x1).

My actual data is 50,000 x 36 (so there are 12 tags with x,y,z coordinates), and I want to calculate the velocity all in one go to avoid iterating (if at all possible). There is also a time column of the same length (50,000x1).

您如何做到的?

谢谢,阿斯特丽德

推荐答案

您的计算比NashPy的要多,而Panda-ish的要多. ,而当您尝试使用melt,groupby等缠绕DataFrame时,解决方案(至少是我想出的解决方案)要复杂得多.

Your calculation is more NumPy-ish than Panda-ish, by which I mean the calculation can be expressed somewhat succinctly if you regard your DataFrame as merely a big array, whereas the solution (at least the one I came up with) is more complicated when you try to wrangle the DataFrame with melt, groupby, etc.

整个计算基本上可以用一行表示:

The entire calculation can be expressed in essentially one line:

np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]

所以这是NumPy的方式:

So here is the NumPy way:

import numpy as np
import pandas as pd
import io
content = '''
Time       A_x       A_y       A_z       B_x       B_y       B_z
-0.075509 -0.123527 -0.547239 -0.453707 -0.969796  0.248761  1.369613
-0.206369 -0.112098 -1.122609  0.218538 -0.878985  0.566872 -1.048862
-0.194552  0.818276 -1.563931  0.097377  1.641384 -0.766217 -1.482096
 0.502731  0.766515 -0.650482 -0.087203 -0.089075  0.443969  0.354747
 1.411380 -2.419204 -0.882383  0.005204 -0.204358 -0.999242 -0.395236
 1.036695  1.115630  0.081825 -1.038442  0.515798 -0.060016  2.669702
 0.392943  0.226386  0.039879  0.732611 -0.073447  1.164285  1.034357
-1.253264  0.389148  0.158289  0.440282 -1.195860  0.872064  0.906377
-0.133580 -0.308314 -0.839347 -0.517989  0.652120  0.477232 -0.391767
 0.623841  0.473552  0.059428  0.726088 -0.593291 -3.186297 -0.846863'''

df = pd.read_table(io.BytesIO(content), sep='\s+', header=True)

arr = df.values
times = arr[:,0]
arr = arr[:,1:]
result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in list('AB')])
print(result)

产生

   Velocity_A  Velocity_B
0   -9.555311  -22.467965
1   -5.568487   -7.177625
2   -9.086257  -12.030091
3    2.007230    1.144208
4    1.824531    0.775006
5    1.472305    2.623467
6    1.954044    3.967796
7   -0.485576   -1.384815
8   -7.736036   -6.722931
9    1.392823    5.369757


由于实际的DataFrame具有形状(50000,36),因此选择一种快速方法可能很重要.这是一个基准:


Since your actual DataFrame has shape (50000, 36), choosing a quick method may be important. Here is a benchmark:

import numpy as np
import pandas as pd
import string

N = 12
col_ids = string.letters[:N]
df = pd.DataFrame(
    np.random.randn(50000, 3*N+1), 
    columns=['Time']+['{}_{}'.format(letter, coord) for letter in col_ids
                      for coord in list('xyz')])


def using_numpy(df):
    arr = df.values
    times = arr[:,0]
    arr = arr[:,1:]
    result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
    result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in col_ids])
    return result

def using_loop(df):
    results = pd.DataFrame(index=df.index) # the result container
    for id in col_ids:
        results['Velocity_'+id] = np.sqrt((df.filter(regex=id+'_')**2).sum(axis=1))/df.Time
    return results

使用 IPython :

In [43]: %timeit using_numpy(df)
10 loops, best of 3: 34.7 ms per loop

In [44]: %timeit using_loop(df)
10 loops, best of 3: 82 ms per loop

这篇关于将函数应用于 pandas 中的列集,逐列“遍历"整个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆