自定义Python pandas 中的rolling_apply函数 [英] Customizing rolling_apply function in Python pandas

查看:946
本文介绍了自定义Python pandas 中的rolling_apply函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

设置



我有一个包含三列的DataFrame:


  • 类别包含True和False,并且我已经通过这些值对 df.groupby('Category')进行分组。

  • 时间包含记录值的时间戳(以秒为单位)。
    <值>值包含值本身。



在每个时间点,记录两个值:一个具有类别True,另一个具有类别False。

滚动应用问题



在每个类别组中,我希望计算一个数字并将其存储在每次结果列中 。结果是时间 t-60 t 之间的值在1到3之间的百分比。



最简单的方法是通过 rolling_count ,然后执行 rolling_apply 来仅计算该间隔中介于1和3之间的值。



是我的代码到目前为止:

  groups = df.groupby(['Category'])
for key,grp分组:
grp = grp.reindex(grp ['Time'])#按时间重新排序,所以我们可以用滚动窗口计数
grp ['total'] = pd.rolling_count(grp ['Value '],window = 60)#计数最近60秒内的数值
grp ['in_interval'] =? ##需要计数最近60秒内1
grp ['Result'] = grp ['in_interval'] / grp ['total']#在过去的60秒内1到3之间的值

正确的 rolling_apply ()调用找到 grp ['in_interval']

解决方案



 导入pandas作为pd 
导入numpy作为np
np.random.seed(1)

def setup(regular = True):
N = 10
x = np.arange(N)
a = np.arange(N)
b = np.arange(N)

如果常规:
timestamps = np.linspace(0,120,N)
else:
timestamps = np.random.uniform(0,120,N)

df = pd.DataFrame({
'Category':[True] * N + [False] * N,
'Time':np.hstack((timestamps,timestamps)),
'Value':np.hstack((a,b))
})
return df

df = setup(regular = False)
df.sort(['Category','Time'],inplace = True)

所以DataFrame, df ,看起来像这样:

 在[4]中:df 
输出[4]:
类别时间值结果
12假0.013725 2 1.000000
15假11.080631 5 0.500000
14假17.610707 4 0.333333
16假22.351225 6 0.250000
13假36.279909 3 0.400000
17假41.467287 7 0.333333
18假47.612097 8 0.285714
10假50.042641 0 0.250000
19假64.658008 9 0.125000
11假86.438939 1 0.333333
2真0.013725 2 1.000000
5真11.080631 5 0.500000
4真17.610707 4 0.333333
6真22.351225 6 0.250000
3真36.279909 3 0.400000
7真41。 467287 7 0.333333
8 True 47.612097 8 0.285714
0 True 50.042641 0 0.250000
9 True 64.658008 9 0.125000
1 True 86.438939 1 0.333333



现在,复制@herrfz,让我们来定义

  def(a,b):
def between_percentage(series):
return float(len(series [(a <= series)& (series between_percentage



< (1,3)
之间的函数p> 是一个函数,它将一个Series作为输入并返回位于半开区间中的元素的分数 [1,3)。例如,

 在[9]中:series = pd.Series([1,2,3,4,5]) $(b 
$ b)[10]:介于(1,3)(系列)
之间[10]:0.4

现在我们将采用我们的DataFrame, df ,并按分类

  df.groupby(['Category'])

对于groupby对象中的每个组,我们都希望应用一个函数:

  df ['Result'] = df.groupby(['Category'])。apply(toeach_category)

函数 toeach_category 将以(子)DataFrame作为输入,并返回一个DataFrame作为输出。整个结果将被分配到一个名为结果的新栏目 df



现在到底要做什么 toeach_category 呢?如果我们这样写 toeach_category

  def toeach_category(subf): 
print(subf)

然后我们看到每个 subf 是这样的DataFrame(当 Category 为False时):

 类别时间价值结果
12错误0.013725 2 1.000000
15错误11.080631 5 0.500000
14错误17.610707 4 0.333333
16错误22.351225 6 0.250000
13假36.279909 3 0.400000
17假41.467287 7 0.333333
18假47.612097 8 0.285714
10假50.042641 0 0.250000
19假64.658008 9 0.125000
11假86.438939 1 0.333333

我们希望每次都使用Times列和 ,应用一个函数。这是用 applymap 完成的:
$ b

  def toeach_category(subf):
result = subf [['Time']]。applymap(percent)

函数 percent 会将时间值作为输入,并返回一个值作为输出。该值将是值在1和3之间的行的一小部分。 applymap 非常严格:百分比不能取任何其他参数。



给定时间 t ,我们可以选择 s从 subf > code> s,其时间在半开区间(t-60,t)使用 ix 方法:

  subf.ix [(t-60 < subf ['Time'])&(subf ['Time'] <= t),'Value'] 

因此,通过在(1,3)之间应用,我们可以在1和3之间找到那些 Values 的百分比。 (1,3)(subf.ix [(t-60
$ b

  Time'])&(subf ['Time'] <= t),'Value'])

现在请记住,我们需要一个函数 percentage ,它将 t 作为输入并返回上面的表达式作为输出:

  def百分比(t):
回报率een(1,3)(subf.ix [(t-60 < subf ['Time'])& (subf ['Time'] <= t),'Value'])

percentage 取决于 subf ,我们不允许传递 subf applymap 非常严格)。

那么我们该如何摆脱这种困境呢?解决方案是在 toeach_category 内定义百分比。 Python的范围规则规定,首先在Local范围,然后是Enclosing范围,Global范围,最后在Builtin范围内寻找一个名为 subf 的裸名。当调用 percentage(t),并且Python遇到 subf 时,Python首先在Local作用域中查找 subf 。由于 subf 不是百分比中的局部变量,因此Python会在函数<$ c的Enclosing范围内查找它$ C> toeach_category 。它在那里找到 subf 。完善。这就是我们需要的。



所以现在我们有我们的功能 toeach_category

  def toeach_category(subf):
def百分比(t):
返回(1,3)(
subf ([ - 时间'])和((小时['时间'] <= t),'值'])
结果=小时[['时间']] .applymap(百分比)
返回结果






把它放在一起,

 将pandas导入为pd 
将numpy导入为np
np.random。 seed(1)


def setup(regular = True):
N = 10
x = np.arange(N)
a = np.arange (N)
b = np.arange(N)

如果常规:
timestamps = np.linspace(0,120,N)
其他:
时间戳= np.random.uniform(0,120,N)

df = pd.DataFrame({
'Category':[True] * N + [False] * N,
'时间':np.hstack((时间戳,时间戳(b))
'Value':np.hstack((a,b))
})
return df


def(a ,b):
def between_percentage(series):
return float(len(series [(a <= series)& (系列< b)]))/浮动(len(系列))
返回between_percentage


def toeach_category(subf):
def百分比(t) :
在(1,3)(
subf.ix [(t-60 result = subf [['Time']]。applymap(percent)
返回结果


df = setup(regular = False)
df.sort(['Category','Time'],inplace = True)
df ['Result'] = df.groupby(['Category'])。apply(toeach_category)
print(df)

yield

 类别时间价值结果
12错误0.013725 2 1.000000
15错误11.080631 5 0.500000
14错误17.610707 4 0.333333
16错误22.351225 6 0.250000
13假36.279909 3 0.200000
17假41.467287 7 0.166667
18假47.612097 8 0.142857
10假5 0.042641 0 0.125000
19假64.658008 9 0.000000
11假86.438939 1 0.166667
2真0.013725 2 1.000000
5真11.080631 5 0.500000
4真17.610707 4 0.333333
6真22.351225 6 0.250000 $ b $ 3真36.279909 3 0.200000
7真41.467287 7 0.166667
8真47.612097 8 0.142857
0真50.042641 0 0.125000
9真64.658008 9 0.000000
1真86.438939 1 0.166667


Setup

I have a DataFrame with three columns:

  • "Category" contains True and False, and I have done df.groupby('Category') to group by these values.
  • "Time" contains timestamps (measured in seconds) at which values have been recorded
  • "Value" contains the values themselves.

At each time instance, two values are recorded: one has category "True", and the other has category "False".

Rolling apply question

Within each category group, I want to compute a number and store it in column Result for each time. Result is the percentage of values between time t-60 and t that fall between 1 and 3.

The easiest way to accomplish this is probably to calculate the total number of values in that time interval via rolling_count, then execute rolling_apply to count only the values from that interval that fall between 1 and 3.

Here is my code so far:

groups = df.groupby(['Category'])
for key, grp in groups:
    grp = grp.reindex(grp['Time']) # reindex by time so we can count with rolling windows
    grp['total'] = pd.rolling_count(grp['Value'], window=60) # count number of values in the last 60 seconds
    grp['in_interval'] = ? ## Need to count number of values where 1<v<3 in the last 60 seconds

    grp['Result'] = grp['in_interval'] / grp['total'] # percentage of values between 1 and 3 in the last 60 seconds

What is the proper rolling_apply() call to find grp['in_interval']?

解决方案

Let's work through an example:

import pandas as pd
import numpy as np
np.random.seed(1)

def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True]*N + [False]*N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a,b))
        })
    return df

df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)

So the DataFrame, df, looks like this:

In [4]: df
Out[4]: 
   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.400000
7      True  41.467287      7  0.333333
8      True  47.612097      8  0.285714
0      True  50.042641      0  0.250000
9      True  64.658008      9  0.125000
1      True  86.438939      1  0.333333

Now, copying @herrfz, let's define

def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage

between(1,3) is a function which takes a Series as input and returns the fraction of its elements which lie in the half-open interval [1,3). For example,

In [9]: series = pd.Series([1,2,3,4,5])

In [10]: between(1,3)(series)
Out[10]: 0.4

Now we are going to take our DataFrame, df, and group by Category:

df.groupby(['Category'])

For each group in the groupby object, we will want to apply a function:

df['Result'] = df.groupby(['Category']).apply(toeach_category)

The function, toeach_category, will take a (sub)DataFrame as input, and return a DataFrame as output. The entire result will be assigned to a new column of df called Result.

Now what exactly must toeach_category do? If we write toeach_category like this:

def toeach_category(subf):
    print(subf)

then we see each subf is a DataFrame such as this one (when Category is False):

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333

We want to take the Times column, and for each time, apply a function. That's done with applymap:

def toeach_category(subf):
    result = subf[['Time']].applymap(percentage)

The function percentage will take a time value as input, and return a value as output. The value will be the fraction of rows with values between 1 and 3. applymap is very strict: percentage can not take any other arguments.

Given a time t, we can select the Values from subf whose times are in the half-open interval (t-60, t] using the ix method:

subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value']

And so we can find the percentage of those Values between 1 and 3 by applying between(1,3):

between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

Now remember that we want a function percentage which takes t as input and returns the above expression as output:

def percentage(t):
    return between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

But notice that percentage depends on subf, and we are not allowed to pass subf to percentage as an argument (again, because applymap is very strict).

So how do we get out of this jam? The solution is to define percentage inside toeach_category. Python's scoping rules say that a bare name like subf is first looked for in the Local scope, then the Enclosing scope, the the Global scope, and lastly in the Builtin scope. When percentage(t) is called, and Python encounters subf, Python first looks in the Local scope for the value of subf. Since subf is not a local variable in percentage, Python looks for it in the Enclosing scope of the function toeach_category. It finds subf there. Perfect. That is just what we need.

So now we have our function toeach_category:

def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result


Putting it all together,

import pandas as pd
import numpy as np
np.random.seed(1)


def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True] * N + [False] * N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a, b))
    })
    return df


def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage


def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result


df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)
df['Result'] = df.groupby(['Category']).apply(toeach_category)
print(df)

yields

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.200000
17    False  41.467287      7  0.166667
18    False  47.612097      8  0.142857
10    False  50.042641      0  0.125000
19    False  64.658008      9  0.000000
11    False  86.438939      1  0.166667
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.200000
7      True  41.467287      7  0.166667
8      True  47.612097      8  0.142857
0      True  50.042641      0  0.125000
9      True  64.658008      9  0.000000
1      True  86.438939      1  0.166667

这篇关于自定义Python pandas 中的rolling_apply函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆