使用当前行作为条件的 pandas 累积总和 [英] Pandas Cumulative Sum using Current Row as Condition

查看:53
本文介绍了使用当前行作为条件的 pandas 累积总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的数据集,大约有200万条记录,每条记录都有一个开始时间和一个结束时间.我想在每个记录中插入一个字段,以计算表中有多少条记录:

I've got a fairly large data set of about 2 million records, each of which has a start time and an end time. I'd like to insert a field into each record that counts how many records there are in the table where:

  • 开始时间小于或等于此行"的开始时间
  • 结束时间大于此行"的开始时间

因此,基本上每个记录最终都有一个计数,其中包括与事件同时发生的活动"事件,包括事件本身.

So basically each record ends up with a count of how many events, including itself, are "active" concurrently with it.

我一直在尝试自学大熊猫如何做到这一点,但我什至不知道从哪里开始寻找.我可以找到许多示例,这些示例将满足诸如> 2"之类的给定条件的行求和,但是似乎无法掌握如何迭代行以根据当前行中的值有条件地对列进行求和.

I've been trying to teach myself pandas to do this with but I am not even sure where to start looking. I can find lots of examples of summing rows that meet a given condition like "> 2", but can't seem to grasp how to iterate over rows to conditionally sum a column based on values in the current row.

推荐答案

您可以尝试下面的代码来获得最终结果.

You can try below code to get the final result.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,10],[5,8],[3,8],[6,9]]),columns=["start","end"])

active_events= {}
for i in df.index:
    active_events[i] = len(df[(df["start"]<=df.loc[i,"start"]) & (df["end"]> df.loc[i,"start"])])
last_columns = pd.DataFrame({'No. active events' : pd.Series(active_events)})

df.join(last_columns)

这篇关于使用当前行作为条件的 pandas 累积总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆