确定“数据"的平均值,其中最高连续数=真 [英] Determine mean value of ‘data’ where the highest number of CONTINUOUS cond=True

查看:53
本文介绍了确定“数据"的平均值,其中最高连续数=真的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有'data'和'cond'(-ition)列的pandas Dataframe.我需要"cond"中具有最多CONTINUOUS True对象的行的(数据列的)平均值.

I have a pandas Dataframe with a 'data' and 'cond'(-ition) column. I need the mean value (of the data column) of the rows with the highest number of CONTINUOUS True objects in 'cond'.

    Example DataFrame:

        cond  data
    0   True  0.20
    1  False  0.30
    2   True  0.90
    3   True  1.20
    4   True  2.30
    5  False  0.75
    6   True  0.80

    Result = 1.466, which is the mean value of row-indexes 2:4 with 3 True

我无法使用groupby或pivot方法找到向量化"解决方案.因此,我写了一个功能来循环行.不幸的是,这需要大约一小时的时间来处理一百万行,这是很长的路.不幸的是,@ jit装饰不会显着减少持续时间.

I was not able to find a „vectorized" solution with a groupby or pivot method. So I wrote a func that loops the rows. Unfortunately this takes about an hour for 1 Million lines, which is way to long. Unfortunately, the @jit decoration does not reduce the duration measurably.

我要分析的数据来自一个监控项目,为期一年,我每3个小时就有一个包含一百万行的DataFrame.因此,大约有3000个此类文件.

The data I want to analyze is from a monitoring project over one year and I have every 3 hours a DataFrame with one Million rows. Thus, about 3000 such files.

有效的解决方案将非常重要.我也非常感谢numpy中的解决方案.

An efficient solution would be very important. I am also very grateful for a solution in numpy.

推荐答案

这是一种基于NumPy的方法-

Here's a NumPy based approach -

# Extract the relevant cond column as a 1D NumPy array and pad with False at
# either ends, as later on we would try to find the start (rising edge) 
# and stop (falling edge) for each interval of True values
arr = np.concatenate(([False],df.cond.values,[False]))

# Determine the rising and falling edges as start and stop 
start = np.nonzero(arr[1:] > arr[:-1])[0]
stop = np.nonzero(arr[1:] < arr[:-1])[0]

# Get the interval lengths and determine the largest interval ID
maxID = (stop - start).argmax()

# With maxID get max interval range and thus get mean on the second col
out = df.data.iloc[start[maxID]:stop[maxID]].mean()

运行时测试

方法作为功能-

def pandas_based(df): # @ayhan's soln
    res = df['data'].groupby((df['cond'] != df['cond'].shift()).\
                                cumsum()).agg(['count', 'mean'])
    return res[res['count'] == res['count'].max()]

def numpy_based(df):
    arr = np.concatenate(([False],df.cond.values,[False]))
    start = np.nonzero(arr[1:] > arr[:-1])[0]
    stop = np.nonzero(arr[1:] < arr[:-1])[0]
    maxID = (stop - start).argmax()
    return df.data.iloc[start[maxID]:stop[maxID]].mean()

时间-

In [208]: # Setup dataframe
     ...: N = 1000  # Datasize
     ...: df = pd.DataFrame(np.random.rand(N),columns=['data'])
     ...: df['cond'] = np.random.rand(N)>0.3 # To have 70% True values
     ...: 

In [209]: %timeit pandas_based(df)
100 loops, best of 3: 2.61 ms per loop

In [210]: %timeit numpy_based(df)
1000 loops, best of 3: 215 µs per loop

In [211]: # Setup dataframe
     ...: N = 10000  # Datasize
     ...: df = pd.DataFrame(np.random.rand(N),columns=['data'])
     ...: df['cond'] = np.random.rand(N)>0.3 # To have 70% True values
     ...: 

In [212]: %timeit pandas_based(df)
100 loops, best of 3: 4.12 ms per loop

In [213]: %timeit numpy_based(df)
1000 loops, best of 3: 331 µs per loop

这篇关于确定“数据"的平均值,其中最高连续数=真的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆