检查大 pandas 数据帧 [英] check on pandas dataframe
问题描述
我有一个由3列组成的熊猫数据框。
I have a pandas dataframe composed by 3 columns.
index start end value
0 0 37647 0
1 37648 37846 1
2 37847 42874 0
3 42875 43049 1
4 43050 51352 0
5 51353 51665 -1
6 51666 54500 0
7 54501 54501 -1
8 54502 55259 0
我想实现一个检查每行的开始和结束之间的差异。
特别是我想做的是:
I would like to implement a check on the difference between start and end of each row. In particular what I would like to do is:
if end row x - start row x == 0 incorporate this row in the previous row
例如第8行
7 54501 54501 -1
已经结束 - start = 0.我想像这样修改数据框
has end - start = 0. I would like to modify the dataframe like this
index start end value
0 0 37647 0
1 37648 37846 1
2 37847 42874 0
3 42875 43049 1
4 43050 51352 0
5 51353 51665 -1
6 51666 54501 0
7 54502 55259 0
然后,由于第7和第8行现在有相同的价值应该成为
and then since the 7th and the 8th row now have the same "value" it should become
0 0 37647 0
1 37648 37846 1
2 37847 42874 0
3 42875 43049 1
4 43050 51352 0
5 51353 51665 -1
6 51666 55259 0
EDITED
请注意,一个特定的情况是
Please note that a particular case would be
index start end value
0 0 37647 0
1 37648 37846 1
2 37847 42874 0
3 42875 43049 1
4 43050 51352 0
5 51353 51665 -1
6 51666 54500 0
7 54501 54501 -1
8 54502 54502 0
9 54503 55259 1
在这种情况下,有2个连续的行(第8和第9),其结尾和起始值之间的差值为0.
在这种情况下,由于索引7th以前被删除,所以提出的答案会给出错误。
我使用while循环而不是for循环来解决这个问题,但我猜这不是最好的事情。
In this case there are 2 consecutive rows (8th and 9th) for which the difference between end and start values is 0. In this case the answer proposed gives an error since the index 7th was deleted previously. I solved this case using a while loop instead of a for loop, but I guess it is not the best thing to do.
对于这种情况我们应该有
For this case we should have
index start end value
0 0 37647 0
1 37648 37846 1
2 37847 42874 0
3 42875 43049 1
4 43050 51352 0
5 51353 51665 -1
6 51666 54502 0
7 54503 55259 1
推荐答案
使用numpy 其中
可以这样做:
Using numpy where
you can do it like this:
import numpy as np
inp = np.where(df.start == df.end)[0]
droplist = []
save = 0
j = 0
for i in range(len(inp)):
if inp[i] > 0:
if inp[i]-inp[i-1] == 1:
j += 1
save += 1
df.loc[inp[i]-1-j,"end"] += save
else:
j = 0
save = 0
df.loc[inp[i]-1,"end"] += 1
droplist.append(inp[i])
df = df.drop(droplist).reset_index(drop=True)
droplist = []
jnp = np.where(df.value == df.value.shift(-1))[0]
for jj in jnp:
df.loc[jj,"end"] = df.loc[jj+1,"end"]
droplist.append(jj+1)
df = df.drop(droplist).reset_index(drop=True)
尽管如此,可能会有更多的pythonic方式没有for循环使用numpy。
There might be a more pythonic way without for-loops using numpy though.
编辑:固定为连续行。
这篇关于检查大 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!