在python pandas中按包含零的行拆分DataFrame [英] Split a DataFrame by rows containing zero in python pandas

查看:48
本文介绍了在python pandas中按包含零的行拆分DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我问的是非常基本的问题并且在其他地方得到了回答(找不到,但可能是我使用了错误的术语),我们深表歉意.

Apologies if what I'm asking is very basic and has been answered elsewhere (couldn't find it, but it could be I'm just using the wrong terminology).

我希望能够在列中出现一定数量的连续 0 时拆分 DataFrame.假设我有这个 DataFrame:

I would like to be able to split a DataFrame when a certain amount of successive 0s appears in a column. Let's say I have this DataFrame:

import pandas as pd
import datetime

idx = pd.date_range('2020-01-01', periods=6, freq='D')

df = pd.DataFrame({'A': range(6), 'B': [3, 2, 0, 0, 1, 2]}, index=pd.date_range('2020-01-01', periods=6, freq='D'))

            A  B
2020-01-01  0  3
2020-01-02  1  2
2020-01-03  2  0
2020-01-04  3  0
2020-01-05  4  1
2020-01-06  5  2

我想得到的是两个形状如下的 DataFrame:

What I'd like to arrive at is two DataFrames that are shaped like this:

            A  B
2020-01-01  0  3
2020-01-02  1  2

            A  B
2020-01-05  4  1
2020-01-06  5  2

我怀疑它可以用 groupby 和可能的 lambda (?) 来完成,但我没有任何运气尝试...

I suspect it can be done with groupby and possibly a lambda (?) but I didn't have any luck trying...

推荐答案

这里有一个不太优雅的解决方案,但它会让你进入你需要的 groupby :)

Here is a not-so-elegant solution, which would however get you to the groupby you need :)

df2 = df.mask((df['B'] == 0) & ((df['B'].shift(1) == 0) | (df['B'].shift(-1) == 0)))
df2['group'] = (df2['B'].shift(1).isnull() & df2['B'].notnull()).cumsum()
df2[df2['B'].notnull()].groupby('group')

如果您检查 df2(我正在创建一个新的,以防万一您想要两个不同的,但如果需要,您可以链接操作),现在看起来像这样:

If you inspect df2 (I'm creating a new one just in case you want to have two different ones, but you can perhaps chain the operation if need be), it looks like this now:

            A     B     group
2020-01-01  0.0   3.0   1
2020-01-02  1.0   2.0   1
2020-01-03  NaN   NaN   1
2020-01-04  NaN   NaN   1
2020-01-05  4.0   1.0   2
2020-01-06  5.0   2.0   2

所以,现在您可以过滤掉 df['B'] 为 null 的值(本质上是在一行中出现两个连续 0 的行),然后分组这个新列 <代码>组.

So, now you can filter out values where df['B'] is null (which is essentially the rows where two consecutive 0s appeared in a row), and then groupby this new column group.

这里发生的是:

df.mask((df['B'] == 0) & ((df['B'].shift(1) == 0) | (df['B'].shift(-1) == 0)))

如果 B 值等于 0 并且前一个或下一个也等于 0,则隐藏这些行(通过 df.mask() 替换为 NaN)

If the B value is equal to 0 and either the previous or next one are also equal to zero, hide these rows (replace with NaN via df.mask())

df2['group'] = (df2['B'].shift(1).isnull() & df2['B'].notnull()).cumsum()

创建一个指标列group,只是为了让Pandas知道groupby要做什么(你也可以直接按整个表达式进行分组,我只是想做这一步清除).该组的定义如下:如果 B 的先前值为 Null,并且当前值不为 null,则定义一个新组.然后取累积和,这样你就可以得到这个捏造的id"分组.

Create an indicator column group, just to let Pandas know what to groupby (you can also just directly group by that whole expression, I just want to make the step clear). The group is defined as follows: a new group is defined if the previous value of B is Null, and if the current value is not null. Then take the cumulative sum, and this way you get this fabricated "id" to groupby.

这篇关于在python pandas中按包含零的行拆分DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆