确定分组数据帧中值的更改 [英] Determine change in values in a grouped dataframe

查看:117
本文介绍了确定分组数据帧中值的更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设一个这样的数据集(最初从.csv读入): {'id':[1,2,3,1,2,3],
'time':['2017-01-01 12:00:00','2017-01-01 12:00 :00','2017-01-01 12:00:00',
'2017-01-01 12:10:00','2017-01-01 12:10:00','2017- 01-01 12:10:00'],
'value':[10,11,12,10,12,13]})

=>

  id时间值
0 1 2017- 01-01 12:00:00 10
1 2 2017-01-01 12:00:00 11
2 3 2017-01-01 12:00:00 12
3 1 2017 -01-01 12:10:00 10
4 2 2017-01-01 12:10:00 12
5 3 2017-01-01 12:10:00 13

每个观察期的所有ID的时间相同。对于许多观察,即每十分钟,该系列继续如此。



我想要在连续的时间之间按id编号的列的总变化数量。例如:对于id = 1,没有变化(结果:0)。对于id = 2,有一个更改(结果:1)。
受这篇文章的启发,我尝试了不同的看法:
确定pandas数据框中的列值是否发生变化



这是我到目前为止所做的(不按预期工作):

  data = data.set_index(['id','time'])#MultiIndex 
grouped = data。 groupby(level ='id')
data ['diff'] = groupped ['value']。diff()
data.loc [data ['diff']。notnull(),'diff '] = 1
data.loc [data ['diff']。isnull(),'diff'] = 0
group ['diff']。sum()

但是,这只是每个id出现次数的总和。



由于我的数据集很大(并且不适合内存),因此解决方案应该尽可能快。 (这就是为什么我在id +时间使用MultiIndex的原因,我期望显着的加速,因为理想的数据不需要再洗牌了。)

另外,我已经来到了dask数据帧与熊猫dfs非常相似。利用它们的解决方案将非常棒。

解决方案

你是否想要这样的东西?

  data.groupby('id')。value.apply(lambda x:len(set(x)) -  1)

您得到

  id 
1 0
2 1
3 1

编辑:正如@COLDSPEED所提到的,如果需要捕获变更使用

  data.groupby('id')。value.apply(lambda x:(x!= x.shift())。sum() -  1)


Assume a dataset like this (which originally is read in from a .csv):

data = pd.DataFrame({'id': [1,2,3,1,2,3],
                     'time':['2017-01-01 12:00:00','2017-01-01 12:00:00','2017-01-01 12:00:00',
                          '2017-01-01 12:10:00','2017-01-01 12:10:00','2017-01-01 12:10:00'],
                     'value': [10,11,12,10,12,13]})

=>

    id  time                    value
0   1   2017-01-01 12:00:00     10
1   2   2017-01-01 12:00:00     11
2   3   2017-01-01 12:00:00     12
3   1   2017-01-01 12:10:00     10
4   2   2017-01-01 12:10:00     12
5   3   2017-01-01 12:10:00     13

Time is identical for all IDs in each observation period. The series goes on like that for many observations, i.e. every ten minutes.

I want the number of total changes in the value column by id between consecutive times. For example: For id=1 there is no change (result: 0). For id=2 there is one change (result: 1). Inspired by this post, I have tried taking differences: Determining when a column value changes in pandas dataframe

This is what I've come up so far (not working as expected):

data = data.set_index(['id', 'time']) # MultiIndex 
grouped = data.groupby(level='id') 
data['diff'] = grouped['value'].diff()
data.loc[data['diff'].notnull(), 'diff'] = 1
data.loc[data['diff'].isnull(), 'diff'] = 0
grouped['diff'].sum()

However, this will just be the sum of occurrences for each id.

Since my dataset is huge (and wont fit into memory), the solution should be as fast as possible. ( This is why I use a MultiIndex on id + time. I expect significant speedup because optimally the data need not be shuffled anymore.)

Moreover, I have come around dask dataframes which are very similar to pandas dfs. A solution making use of them would be fantastic.

解决方案

Do you want something like this?

data.groupby('id').value.apply(lambda x: len(set(x)) - 1)

You get

id
1    0
2    1
3    1

Edit: As @COLDSPEED mentioned, if the requirement is to capture change back to a certain value, use

data.groupby('id').value.apply(lambda x: (x != x.shift()).sum() - 1)

这篇关于确定分组数据帧中值的更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆