检查大 pandas 数据帧 [英] check on pandas dataframe

查看:128
本文介绍了检查大 pandas 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由3列组成的熊猫数据框。

I have a pandas dataframe composed by 3 columns.

  index  start  end     value
    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   54500   0
    7   54501   54501   -1
    8   54502   55259   0

我想实现一个检查每行的开始和结束之间的差异。
特别是我想做的是:

I would like to implement a check on the difference between start and end of each row. In particular what I would like to do is:

if end row x - start row x  == 0 incorporate this row in the previous row

例如第8行

7   54501   54501   -1

已经结束 - start = 0.我想像这样修改数据框

has end - start = 0. I would like to modify the dataframe like this

  index  start  end     value
    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   54501   0
    7   54502   55259   0

然后,由于第7和第8行现在有相同的价值应该成为

and then since the 7th and the 8th row now have the same "value" it should become

    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   55259   0

EDITED

请注意,一个特定的情况是

Please note that a particular case would be

  index  start  end     value
    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   54500   0
    7   54501   54501   -1
    8   54502   54502   0
    9   54503   55259   1

在这种情况下,有2个连续的行(第8和第9),其结尾和起始值之间的差值为0.
在这种情况下,由于索引7th以前被删除,所以提出的答案会给出错误。
我使用while循环而不是for循环来解决这个问题,但我猜这不是最好的事情。

In this case there are 2 consecutive rows (8th and 9th) for which the difference between end and start values is 0. In this case the answer proposed gives an error since the index 7th was deleted previously. I solved this case using a while loop instead of a for loop, but I guess it is not the best thing to do.

对于这种情况我们应该有

For this case we should have

  index  start  end     value
    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   54502   0
    7   54503   55259   1


推荐答案

使用numpy 其中可以这样做:

Using numpy where you can do it like this:

import numpy as np

inp = np.where(df.start == df.end)[0]
droplist = []
save = 0
j = 0
for i in range(len(inp)):
    if inp[i] > 0:
        if inp[i]-inp[i-1] == 1:
            j += 1
            save += 1
            df.loc[inp[i]-1-j,"end"] += save
        else:
            j = 0
            save = 0
            df.loc[inp[i]-1,"end"] += 1
        droplist.append(inp[i])
df = df.drop(droplist).reset_index(drop=True)

droplist = []
jnp = np.where(df.value == df.value.shift(-1))[0]
for jj in jnp:
    df.loc[jj,"end"] = df.loc[jj+1,"end"]
    droplist.append(jj+1)
df = df.drop(droplist).reset_index(drop=True)

尽管如此,可能会有更多的pythonic方式没有for循环使用numpy。

There might be a more pythonic way without for-loops using numpy though.

编辑:固定为连续行。

这篇关于检查大 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆