大 pandas 仅替换列的一部分 [英] pandas replace only part of a column

查看:76
本文介绍了大 pandas 仅替换列的一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的输入内容:

import pandas as pd
import numpy as np

list1 = [10,79,6,38,4,557,12,220,46,22,45,22]
list2 = [4,3,23,6,234,47,312,2,426,42,435,23]

df = pd.DataFrame({'A' : list1, 'B' : list2}, columns = ['A', 'B'])
df['C'] = np.where (df['A'] > df['B'].shift(-2), 1, np.nan)
print (df)

产生此输出的

:

that produces this output:

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  NaN
3    38    6  NaN
4     4  234  NaN
5   557   47  1.0
6    12  312  NaN
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

我需要做的是将列"C"更改为连续三个不重复的1的集合.所需的输出是:

What I need to do is to change column 'C' to be a set of three 1's in a row, non-overlapping. The desired output is:

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  1.0
3    38    6  1.0
4     4  234  NaN
5   557   47  1.0
6    12  312  1.0
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

因此,第2、3和6行从NaN更改为1.0.第7行已经有一个1.0,因此将被忽略.第7行和第8行需要保留NaN,因为第7行是前一组的最后一个条目.

So, rows 2, 3, and 6 change from NaN to 1.0. Row 7 already has a 1.0 and it is ignored. Rows 8 and 9 need to stay NaN because row 7 is the last entry of the previous set.

我不知道是否有更好的方法来创建列"C",该列将在创建时执行此操作.

I don't know if there is a better way to build column 'C' that would do this at creation.

我尝试了fillna和ffill的多个版本,但没有一个对我有用.

I have tried several versions of fillna and ffill, none of them worked for me.

这似乎很令人费解,但我尝试使用此行隔离每个1.0的行ID:

It seems very convoluted but I tried to isolate the row id's for each 1.0 with this line:

print (df.loc[df['C'] == 1])

可以正确输出以下内容:

Which correctly outputs this:

     A   B    C
1   79   3  1.0
5  557  47  1.0
7  220   2  1.0

即使我知道这些信息,我也不知道如何从那里进行.

Even though I know that information, I don't know how to proceed from there.

非常感谢您的提前帮助, 大卫

Thank you so much for your help in advance, David

推荐答案

更快的版本(感谢b2002):

Faster version (thanks to b2002):

ii = df[pd.notnull(df.C)].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    df.C.values[ci:ci+3] = 1.0


首先,通过查看C列中不为空的点之间的差异来获取所有起点的索引,即所有1.0且后面有两个NaN的点(第一个索引包含在默认值),然后遍历这些索引并使用loc更改C列的切片:


First get the indices of all your starting points, i.e. all your points that are 1.0 and have two NaN following, by looking at the differences between the points that are not null in the C column (first index is included by default), then iterate over those indices and use loc to change slices of your C column:

ii = df[pd.notnull(df.C)].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    df.loc[ci:ci+2,'C'] = 1.0

结果:

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  1.0
3    38    6  1.0
4     4  234  NaN
5   557   47  1.0
6    12  312  1.0
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

这篇关于大 pandas 仅替换列的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆