确定分组数据帧中值的更改 [英] Determine change in values in a grouped dataframe

查看：117 发布时间：2018/5/30 14:14:34 python pandas dataframe group-by pandas-groupby

本文介绍了确定分组数据帧中值的更改的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设一个这样的数据集（最初从.csv读入）： {'id'：[1,2,3,1,2,3]，
'time'：['2017-01-01 12:00:00'，'2017-01-01 12:00 ：00'，'2017-01-01 12:00:00'，
'2017-01-01 12:10:00'，'2017-01-01 12:10:00'，'2017- 01-01 12:10:00']，
'value'：[10,11,12,10,12,13]}）

=>
id时间值 0 1 2017- 01-01 12:00:00 10 1 2 2017-01-01 12:00:00 11 2 3 2017-01-01 12:00:00 12 3 1 2017 -01-01 12:10:00 10 4 2 2017-01-01 12:10:00 12 5 3 2017-01-01 12:10:00 13
每个观察期的所有ID的时间相同。对于许多观察，即每十分钟，该系列继续如此。

我想要在连续的时间之间按id编号的值列的总变化数量。例如：对于id = 1，没有变化（结果：0）。对于id = 2，有一个更改（结果：1）。
受这篇文章的启发，我尝试了不同的看法：
确定pandas数据框中的列值是否发生变化

这是我到目前为止所做的（不按预期工作）：

data = data.set_index（['id'，'time']）＃MultiIndex grouped = data。 groupby（level ='id'） data ['diff'] = groupped ['value']。diff（） data.loc [data ['diff']。notnull（），'diff '] = 1 data.loc [data ['diff']。isnull（），'diff'] = 0 group ['diff']。sum（）
但是，这只是每个id出现次数的总和。

由于我的数据集很大（并且不适合内存），因此解决方案应该尽可能快。（这就是为什么我在id +时间使用MultiIndex的原因，我期望显着的加速，因为理想的数据不需要再洗牌了。）

另外，我已经来到了dask数据帧与熊猫dfs非常相似。利用它们的解决方案将非常棒。
解决方案
你是否想要这样的东西？

data.groupby（'id'）。value.apply（lambda x：len（set（x）） - 1）
您得到

id 1 0 2 1 3 1
编辑：正如@COLDSPEED所提到的，如果需要捕获变更使用
data.groupby（'id'）。value.apply（lambda x：（x！= x.shift（））。sum（） - 1）

Assume a dataset like this (which originally is read in from a .csv):
data = pd.DataFrame({'id': [1,2,3,1,2,3], 'time':['2017-01-01 12:00:00','2017-01-01 12:00:00','2017-01-01 12:00:00', '2017-01-01 12:10:00','2017-01-01 12:10:00','2017-01-01 12:10:00'], 'value': [10,11,12,10,12,13]})
=>
id time value 0 1 2017-01-01 12:00:00 10 1 2 2017-01-01 12:00:00 11 2 3 2017-01-01 12:00:00 12 3 1 2017-01-01 12:10:00 10 4 2 2017-01-01 12:10:00 12 5 3 2017-01-01 12:10:00 13
Time is identical for all IDs in each observation period. The series goes on like that for many observations, i.e. every ten minutes.

I want the number of total changes in the value column by id between consecutive times. For example: For id=1 there is no change (result: 0). For id=2 there is one change (result: 1). Inspired by this post, I have tried taking differences: Determining when a column value changes in pandas dataframe

This is what I've come up so far (not working as expected):
data = data.set_index(['id', 'time']) # MultiIndex grouped = data.groupby(level='id') data['diff'] = grouped['value'].diff() data.loc[data['diff'].notnull(), 'diff'] = 1 data.loc[data['diff'].isnull(), 'diff'] = 0 grouped['diff'].sum()
However, this will just be the sum of occurrences for each id.

Since my dataset is huge (and wont fit into memory), the solution should be as fast as possible. ( This is why I use a MultiIndex on id + time. I expect significant speedup because optimally the data need not be shuffled anymore.)

Moreover, I have come around dask dataframes which are very similar to pandas dfs. A solution making use of them would be fantastic.
解决方案
Do you want something like this?
data.groupby('id').value.apply(lambda x: len(set(x)) - 1)
You get
id 1 0 2 1 3 1
Edit: As @COLDSPEED mentioned, if the requirement is to capture change back to a certain value, use
data.groupby('id').value.apply(lambda x: (x != x.shift()).sum() - 1)

这篇关于确定分组数据帧中值的更改的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

确定分组数据帧中值的更改 [英] Determine change in values in a grouped dataframe

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

确定分组数据帧中值的更改 [英] Determine change in values in a grouped dataframe

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭