Pandas Python:在一个数据框中合并每两行 [英] Pandas Python: Merging every two rows in one dataframe
问题描述
我如何获得
Idx A B C
2004-04-01 1 1 0
2004-04-02 1 1 0
2004-05-01 0 0 0
2004-05-02 0 0 0
到
Idx A B C
2004-04 2 2 0
2004-05 0 0 0
注意: 如何折叠索引(更具体地说,使索引仅转换为月份)和每两行折叠一次?
Notes: How do I collapse both the index (more specifically, making the index convert into just the month) and every two rows?
滚动是最好的方法吗?
更新-我简化了上面的版本,但是unutbu的答案似乎不起作用
UPDATE - I made the above version simple, but unutbu's answer does not seem to work
Time A B
1 2004-01-04 - 2004-01-10 0 0
2 2004-01-11 - 2004-01-17 0 0
3 2004-01-18 - 2004-01-24 0 0
4 2004-01-25 - 2004-01-31 0 0
5 2004-02-01 - 2004-02-07 56 0
6 2004-02-08 - 2004-02-14 67 0
推荐答案
您可以使用 groupby/sum
操作:
You can aggregate rows using a groupby/sum
operation:
import pandas as pd
import numpy as np
df = pd.DataFrame([('2004-04-01', 1L, 1L, 0L), ('2004-04-02', 1L, 1L, 0L),
('2004-05-01', 0L, 0L, 0L), ('2004-05-02', 0L, 0L, 0L)],
columns=['Idx', 'A', 'B', 'C'])
df['Idx'] = pd.DatetimeIndex(df['Idx'])
您可以按年份和月份分组:
You could group by the year and month:
print(df.groupby([d.strftime('%Y-%m') for d in df['Idx']]).sum())
# A B C
# 2004-04 2 2 0
# 2004-05 0 0 0
# [2 rows x 3 columns]
或者,每两行分组一次:
Or, group by every two rows:
result = df.groupby(np.arange(len(df))//2).sum()
result.index = df.loc[1::2, 'Idx']
print(result)
# A B C
# Idx
# 2004-04-02 2 2 0
# 2004-05-02 0 0 0
# [2 rows x 3 columns]
注意:使用的是df.loc[1::2, 'Idx']
而不是df.loc[::2, 'Idx']
,因此汇总行的Idx
对应于每个组中的第二个日期,而不是第一个日期.
Note: df.loc[1::2, 'Idx']
was used, instead of df.loc[::2, 'Idx']
so the Idx
for the aggregated rows would correspond to the second date, not the first, in each group.
如果只需要年份和月份,则可以使用以下列表理解来设置索引:
If you want just the year and month, then you could use this list comprehension to set the index:
result.index = [d.strftime('%Y-%m') for d in df.loc[1::2, 'Idx']]
print(result)
# A B C
# 2004-04 2 2 0
# 2004-05 0 0 0
# [2 rows x 3 columns]
但是,在处理日期时,使用DatetimeIndex作为索引而不是字符串值索引更为强大.因此,您可能希望保留DatetimeIndex,使用DatetimeIndex进行大部分工作,并仅在末尾将其转换为年月字符串以用于演示目的...
However, it's more powerful to have a DatetimeIndex for the index rather than a string-valued index when dealing with dates. So you might want to retain the DatetimeIndex, do most of your work with the DatetimeIndex, and just convert to a year-month string at the end for presentation purposes...
关于更新的问题:
import pandas as pd
import numpy as np
data = np.rec.array([('2004-01-04 - 2004-01-10', 0L, 0L),
('2004-01-11 - 2004-01-17', 0L, 0L),
('2004-01-18 - 2004-01-24', 0L, 0L),
('2004-01-25 - 2004-01-31', 0L, 0L),
('2004-02-01 - 2004-02-07', 56L, 0L),
('2004-02-08 - 2004-02-14', 67L, 0L)],
dtype=[('Time', 'O'), ('A', '<i8'), ('B', '<i8')])
df = pd.DataFrame(data)
具有一个包含两个日期的时间"列会使数据操作更加困难.最好有两个DatetimeIndex
列,Start
和End
:
Having one Time column holding two dates makes data manipulation more difficult. It would be better to have two DatetimeIndex
columns, Start
and End
:
df[['Start', 'End']] = df['Time'].str.extract('(?P<Start>.+) - (?P<End>.+)')
del df['Time']
df['Start'] = pd.DatetimeIndex(df['Start'])
df['End'] = pd.DatetimeIndex(df['End'])
然后您可以按Start
列进行分组:
Then you could group by the Start
column:
print(df.groupby([d.strftime('%Y-%m') for d in df['Start']]).sum())
# A B
# 2004-01 0 0
# 2004-02 123 0
# [2 rows x 2 columns]
或每两行分组一次,与以前基本相同:
Or group by every two rows, essentially the same as before:
result = df.groupby(np.arange(len(df))//2).sum()
result.index = df.loc[1::2, 'Start']
print(result)
# A B
# Start
# 2004-01-11 0 0
# 2004-01-25 0 0
# 2004-02-08 123 0
# [3 rows x 2 columns]
这篇关于Pandas Python:在一个数据框中合并每两行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!