尝试按对象将行附加到每个组中时的奇怪行为 [英] Strange behavior when trying to append a row to each group in a group by object
问题描述
这个问题是关于一个函数在两个不同的数据帧上应用时表现出意外的行为-更确切地说,是groupby对象。我可能错过的东西显然是错误的,或者熊猫中有错误。
This question is about a function behaving in an unexpected manner when applied on two different dataframes - more precisely, groupby objects. Either I'm missing something that is obviously wrong or there's a bug in pandas.
我写了以下函数来附加groupby对象中每个组的一行。这个问题是与该功能相关的另一个问题。
I wrote the below function to append a row to each group in a groupby object.This question is another question that is related to the function.
def myfunction(g, now):
'''This function appends a row to each group and populates the DTM column value of that row with the current timestamp. Other columns of the new row will have NaN s.
g: a groupby object
now: current timestamp
returns a dataframe that has the current timestamp appended in the DTM column for each group
'''
g.loc[g.shape[0], 'DTM'] = now # Appending the current timestamp to a DTM column in each group
return g
我们将运行两个测试来测试该功能。
We'll run two tests to test the function.
测试1
它在链接的问题(在上面的问题中演示)的数据框 a
中按预期工作。为了稍微清晰起见,这里稍微增加了重新运行的内容(主要是从链接的问题中复制粘贴)。
It works as expected on the dataframe a
in the linked question(demonstrated in the above question). Here's a slightly augmented a re-run for more clarity (mostly copy-pasted from the linked question).
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a
Out[50]:
0
first second
bar one 0.134379
one 0.967928
two 0.067502
baz one 0.182887
two 0.926932
foo one 0.806225
two 0.718322
two 0.932114
qux one 0.772494
two 0.141510
应用该功能,
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
它为每个组添加了新行。添加了新的 DTM
列,因为它不在原始的 a
中。一组是第一
-第二
对。
It has appended a new row to each group. A new DTM
column was added because it wasn't in the original a
. A group is a first
- second
pair.
a
Out[52]:
first second 0 DTM
first second
bar one 0 bar one 0.134379 NaT
1 bar one 0.967928 NaT
2 NaN NaN NaN 2017-07-03 18:56:33.183
two 2 bar two 0.067502 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
baz one 3 baz one 0.182887 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 4 baz two 0.926932 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
foo one 5 foo one 0.806225 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 6 foo two 0.718322 NaT
7 foo two 0.932114 NaT
2 NaN NaN NaN 2017-07-03 18:56:33.183
qux one 8 qux one 0.772494 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 9 qux two 0.141510 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
一些细化,
a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,'DTM')]
这给出了最终的 a
,
a
Out[62]:
0 DTM
first second
bar one 0.371683 NaT
one 0.327870 NaT
one NaN 2017-07-03 18:56:33.183
two 0.048794 NaT
two NaN 2017-07-03 18:56:33.183
baz one 0.462747 NaT
one NaN 2017-07-03 18:56:33.183
two 0.758674 NaT
two NaN 2017-07-03 18:56:33.183
foo one 0.238607 NaT
one NaN 2017-07-03 18:56:33.183
two 0.156104 NaT
two 0.594270 NaT
two NaN 2017-07-03 18:56:33.183
qux one 0.091088 NaT
one NaN 2017-07-03 18:56:33.183
two 0.795864 NaT
two NaN 2017-07-03 18:56:33.183
到目前为止很好。这是预期的行为。每对第一
-第二
对和 DTM $都附加了新行。该行的c $ c>列已填充了当前时间戳。
So far so good. This is the expected behavior. A new row has been appended to each first
-second
pair and the DTM
column of that row has been populated with the current timestamp.
测试2
令人惊讶的是,我无法在下面的数据框 df
中重现该行为。一组是 ID
- SEQ
组合
Surprisingly, I couldn't reproduce that behavior in the below dataframe df
. A group is an ID
-SEQ
combination
此 df
可以使用以下方式复制:
This df
can be reproduced using:
1。
C1 572 5/9/2017 10:13 PE
C1 572 5/9/2017 12:24 OK
C1 579 5/9/2017 10:19 PE
C1 579 5/9/2017 13:25 OK
C1 587 5/9/2017 10:20 PE
C1 587 5/9/2017 12:25 OK
C1 590 5/9/2017 10:21 PE
C1 590 5/9/2017 13:09 OK
C1 604 5/9/2017 10:38 PE
C1 604 5/9/2017 12:32 OK
C1 609 5/9/2017 10:39 PE
C1 609 5/9/2017 13:29 OK
C1 613 5/9/2017 10:39 PE
C1 613 5/9/2017 13:08 OK
C1 618 5/9/2017 10:40 PE
C1 618 5/9/2017 13:33 OK
C1 636 5/9/2017 10:54 PE
C1 636 5/9/2017 13:36 OK
C1 642 5/9/2017 10:55 PE
C1 642 5/9/2017 13:35 OK
C1 643 5/9/2017 10:56 PE
C1 643 5/9/2017 13:34 OK
C1 656 5/9/2017 10:55 PE
C1 656 5/9/2017 13:36 OK
C2 86 9/5/2016 19:45 PE
C2 86 9/6/2016 11:55 OK
C3 10 4/17/2017 12:23 PE
C3 10 4/17/2017 14:51 OK
C4 38 3/25/2017 10:35 PE
C4 38 3/25/2017 10:51 OK
2。然后执行它们,
df = pd.read_clipboard(sep = '[ ]{2,}')
df.columns = ['ID', 'SEQ', 'DTM', 'STATUS']
设置多索引
Setting a multi-index
d = df.set_index(['ID', 'SEQ', 'DTM']) # I have three index levels this time in the original dataframe
中具有三个索引级别
d
的样子,
d
Out[40]:
STATUS
ID SEQ DTM
C1 572 5/9/2017 10:13 PE
5/9/2017 12:24 OK
579 5/9/2017 10:19 PE
5/9/2017 13:25 OK
587 5/9/2017 10:20 PE
5/9/2017 12:25 OK
590 5/9/2017 10:21 PE
5/9/2017 13:09 OK
604 5/9/2017 10:38 PE
5/9/2017 12:32 OK
609 5/9/2017 10:39 PE
5/9/2017 13:29 OK
613 5/9/2017 10:39 PE
5/9/2017 13:08 OK
618 5/9/2017 10:40 PE
5/9/2017 13:33 OK
636 5/9/2017 10:54 PE
5/9/2017 13:36 OK
642 5/9/2017 10:55 PE
5/9/2017 13:35 OK
643 5/9/2017 10:56 PE
5/9/2017 13:34 OK
656 5/9/2017 10:55 PE
5/9/2017 13:36 OK
C2 86 9/5/2016 19:45 PE
9/6/2016 11:55 OK
C3 10 4/17/2017 12:23 PE
4/17/2017 14:51 OK
C4 38 3/25/2017 10:35 PE
3/25/2017 10:51 OK
应用该功能
dd = d.reset_index().groupby(['ID', 'SEQ']).apply(lambda x: myfunction(x, now)) # a group is a unique combination of ID-SEQ pairs
这将返回(注意第四行)
This returns, (notice the fourth row)
dd
Out[37]:
ID SEQ DTM STATUS
ID SEQ
C1 572 0 C1 572.0 5/9/2017 10:13 PE
1 C1 572.0 5/9/2017 12:24 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
579 2 C1 579.0 2017-07-03 18:56:33.183000 PE
3 C1 579.0 5/9/2017 13:25 OK
587 4 C1 587.0 5/9/2017 10:20 PE
5 C1 587.0 5/9/2017 12:25 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
590 6 C1 590.0 5/9/2017 10:21 PE
7 C1 590.0 5/9/2017 13:09 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
604 8 C1 604.0 5/9/2017 10:38 PE
9 C1 604.0 5/9/2017 12:32 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
609 10 C1 609.0 5/9/2017 10:39 PE
11 C1 609.0 5/9/2017 13:29 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
613 12 C1 613.0 5/9/2017 10:39 PE
13 C1 613.0 5/9/2017 13:08 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
618 14 C1 618.0 5/9/2017 10:40 PE
15 C1 618.0 5/9/2017 13:33 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
636 16 C1 636.0 5/9/2017 10:54 PE
17 C1 636.0 5/9/2017 13:36 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
642 18 C1 642.0 5/9/2017 10:55 PE
19 C1 642.0 5/9/2017 13:35 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
643 20 C1 643.0 5/9/2017 10:56 PE
21 C1 643.0 5/9/2017 13:34 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
656 22 C1 656.0 5/9/2017 10:55 PE
23 C1 656.0 5/9/2017 13:36 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
C2 86 24 C2 86.0 9/5/2016 19:45 PE
25 C2 86.0 9/6/2016 11:55 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
C3 10 26 C3 10.0 4/17/2017 12:23 PE
27 C3 10.0 4/17/2017 14:51 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
C4 38 28 C4 38.0 3/25/2017 10:35 PE
29 C4 38.0 3/25/2017 10:51 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
一些细化,
ddd = dd.reset_index(level = 2).drop(('level_2', 'ID', 'SEQ')).loc[:,('DTM','STATUS')]
ddd
Out[39]:
DTM STATUS
ID SEQ
C1 572 5/9/2017 10:13 PE
572 5/9/2017 12:24 OK
572 2017-07-03 18:56:33.183000 NaN
579 2017-07-03 18:56:33.183000 PE
579 5/9/2017 13:25 OK
587 5/9/2017 10:20 PE
587 5/9/2017 12:25 OK
587 2017-07-03 18:56:33.183000 NaN
590 5/9/2017 10:21 PE
590 5/9/2017 13:09 OK
590 2017-07-03 18:56:33.183000 NaN
604 5/9/2017 10:38 PE
604 5/9/2017 12:32 OK
604 2017-07-03 18:56:33.183000 NaN
609 5/9/2017 10:39 PE
609 5/9/2017 13:29 OK
609 2017-07-03 18:56:33.183000 NaN
613 5/9/2017 10:39 PE
613 5/9/2017 13:08 OK
613 2017-07-03 18:56:33.183000 NaN
618 5/9/2017 10:40 PE
618 5/9/2017 13:33 OK
618 2017-07-03 18:56:33.183000 NaN
636 5/9/2017 10:54 PE
636 5/9/2017 13:36 OK
636 2017-07-03 18:56:33.183000 NaN
642 5/9/2017 10:55 PE
642 5/9/2017 13:35 OK
642 2017-07-03 18:56:33.183000 NaN
643 5/9/2017 10:56 PE
643 5/9/2017 13:34 OK
643 2017-07-03 18:56:33.183000 NaN
656 5/9/2017 10:55 PE
656 5/9/2017 13:36 OK
656 2017-07-03 18:56:33.183000 NaN
C2 86 9/5/2016 19:45 PE
86 9/6/2016 11:55 OK
86 2017-07-03 18:56:33.183000 NaN
C3 10 4/17/2017 12:23 PE
10 4/17/2017 14:51 OK
10 2017-07-03 18:56:33.183000 NaN
C4 38 3/25/2017 10:35 PE
38 3/25/2017 10:51 OK
38 2017-07-03 18:56:33.183000 NaN
问题
A包含当前时间戳的新行已附加到每个 ID
- SEQ
组,但 C1
- 579
组! ( dd
和 ddd
中的第四行)
A new row containing the current timestamp has been appended to each ID
-SEQ
group with the exception of C1
-579
group! (4th row in dd
and ddd
)
问题
- 导致此问题的原因是什么?
- 在
dd
中引入的附加索引级别是什么?
- What caused this problem?
- What is the additional index level introduced in
dd
?
推荐答案
注意到很多调试问题之后。
After a lot debugging problem was noticed.
在级别 3中存在相同编号的问题
-您上一个示例中的组为 2
的形状,但是此值之前存在,因此没有添加新行,而在行上被覆盖。
There is problem with same number in level 3
- in your last sample is shape of group 2
, but this value exist before, so new row was no added onlu row was overwritten.
ID SEQ DTM STATUS
ID SEQ
C1 572 0 C1 572.0 2017-05-09 10:13:00.000000 PE
1 C1 572.0 2017-05-09 12:24:00.000000 OK
2 NaN NaN 2017-07-06 08:46:02.341472 NaN
579 2 C1 579.0 2017-07-06 08:46:02.341472 PE <- ovetwritten values in row
3 C1 579.0 2017-05-09 13:25:00.000000 OK
587 4 C1 587.0 2017-05-09 10:20:00.000000 PE
5 C1 587.0 2017-05-09 12:25:00.000000 OK
2 NaN NaN 2017-07-06 08:46:02.341472 NaN
第一个样本为不错,因为第二组只有一行。
First sample was nice because second group has only one row.
但是如果有两行:
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'two','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
print (a)
0
first second
bar one 0.366258
two 0.583205
two 0.159388
baz one 0.598198
two 0.274027
foo one 0.086461
two 0.353577
two 0.823377
qux one 0.098737
two 0.128470
相同的问题。
print (a)
first second 0 DTM
first second
bar one 0 bar one 0.366258 NaT
1 NaN NaN NaN 2017-07-06 08:47:55.610671
two 1 bar two 0.583205 NaT
2 bar two 0.159388 2017-07-06 08:47:55.610671 <- ovetwritten
baz one 3 baz one 0.598198 NaT
1 NaN NaN NaN 2017-07-06 08:47:55.610671
two 4 baz two 0.274027 NaT
因此,如果对功能进行了一些更改,则所有功能都可以完美运行:
So if function is a bit changed all works perfectly:
now = pd.datetime.now()
def myfunction(g, now):
g.loc[str(g.shape[0]) + 'a', 'DTM'] = now
return g
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'two','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
print (a)
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
print (a)
first second 0 DTM
first second
bar one 0 bar one 0.677641 NaT
1a NaN NaN NaN 2017-07-06 08:54:47.481671
two 1 bar two 0.274588 NaT
2 bar two 0.524903 NaT
2a NaN NaN NaN 2017-07-06 08:54:47.481671
baz one 3 baz one 0.198272 NaT
1a NaN NaN NaN 2017-07-06 08:54:47.481671
two 4 baz two 0.787949 NaT
1a NaN NaN NaN 2017-07-06 08:54:47.481671
foo one 5 foo one 0.484197 NaT
1a NaN NaN NaN 2017-07-06 08:54:47.481671
这篇关于尝试按对象将行附加到每个组中时的奇怪行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!