从宽到长转换,但重复特定的列 [英] Transform wide to long but with repetition of a specific column
问题描述
我有一个如下所示的数据帧
I have a dataframe as shown below
df2 = pd.DataFrame({'pid':[1,2,3,4],'BP1Date':['12/11/2016','12/21/2016','12/31/2026',np.nan],'BP1di':[21,24,25,np.nan],'BP1sy':[123,125,127,np.nan],'BP2Date':['12/31/2016','12/31/2016','12/31/2016','12/31/2016'],'BP2di':[21,26,28,30],'BP2sy':[123,130,135,145],
'BP3Date':['12/31/2017','12/31/2018','12/31/2019','12/31/2116'],'BP3di':[21,31,36,np.nan],'BP3sy':[123,126,145,np.nan]})
如下所示
我希望我的输出如下所示
I expect my output to be like as shown below
这是我根据S尝试的结果O,其他职位的建议,但我无法产生或接近预期的产出
This is what I tried based on SO suggestions from other posts but I am unable to produce or get close to the expected output
df = pd.melt(df2, id_vars='pid', var_name='col', value_name='dates')
df['col2'] = [x.split("Date")[0][:3] for x in df['col']]
df = df[df.groupby(['pid','col2'])['dates'].transform('count').ne(0)].copy()
df['col3'] = df['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df.sort_values(by=['pid','col3'])
请注意两件事
a)对于每个日期,我都有两个读数(BP {n} di,BP {n} si)
a) For each date, I have two readings (BP{n}di, BP{n}si)
b)我想只在所有3列
都为NA时才删除NA记录(在这种情况下,对于pid = 4,BP1Date,BP1di,BP1sy为NA)。如果任何列都不是NA,则应保留NA,如下所示。因此,我没有使用stack(dropna = False),而是基于SO帖子使用了pd.melt
b) I would like to drop NA records only when it is NA for all 3 columns
together (In this case for pid = 4, BP1Date, BP1di, BP1sy is NA). If any of the column is not NA, then the NA should be retained as shown below. Hence I didn't use stack(dropna=False) instead I am using pd.melt based on SO posts
如何转换输入以实现输出,如图所示
How can I transform the input to achieve the output as shown above in screenshot?
根据答案评论更新了屏幕截图
推荐答案
在 lreshape
中使用 DataFrame.stack
进行重塑,然后通过 DataFrame.dropna
并按前三列进行排序:
Use lreshape
with DataFrame.stack
for reshape, then remove missing values by Date
column by DataFrame.dropna
and sorting by first 3 columns:
a = [col for col in df2.columns if col.endswith('Date')]
b = [col for col in df2.columns if col.endswith('di')]
c = [col for col in df2.columns if col.endswith('sy')]
df1 = (pd.lreshape(df2, {'Date':a, 'di':b, 'sy':c}, dropna=False)
.set_index(['pid','Date'])
.stack(dropna=False)
.rename_axis(['pid','Date','type'])
.reset_index(name='value')
.dropna(subset=['Date'])
.assign(Date = lambda x: pd.to_datetime(x['Date'], dayfirst=True))
.sort_values(['pid','Date','type'])
.reset_index(drop=True)
)
print (df1)
pid Date type value
0 1 2016-11-12 di 21.0
1 1 2016-11-12 sy 123.0
2 1 2016-12-31 di 21.0
3 1 2016-12-31 sy 123.0
4 1 2017-12-31 di 21.0
5 1 2017-12-31 sy 123.0
6 2 2016-12-21 di 24.0
7 2 2016-12-21 sy 125.0
8 2 2016-12-31 di 26.0
9 2 2016-12-31 sy 130.0
10 2 2018-12-31 di 31.0
11 2 2018-12-31 sy 126.0
12 3 2016-12-31 di 28.0
13 3 2016-12-31 sy 135.0
14 3 2019-12-31 di 36.0
15 3 2019-12-31 sy 145.0
16 3 2026-12-31 di 25.0
17 3 2026-12-31 sy 127.0
18 4 2016-12-31 di 30.0
19 4 2016-12-31 sy 145.0
20 4 2116-12-31 di NaN
21 4 2116-12-31 sy NaN
替代解决方案是使用 Series.str.extract
和 MultiIndex.from_tuples
:
Alternative solution is with MultiIndex
in columns created by Series.str.extract
and MultiIndex.from_tuples
:
df2 = df2.set_index('pid')
c = df2.columns.to_frame(name='orig')
c = c['orig'].str.extract('(.+)(Date|di|sy)').apply(tuple, 1)
df2.columns = pd.MultiIndex.from_tuples(c)
df1 = (df2.stack(0)
.set_index(['Date'], append=True)
.reset_index(level=1, drop=True)
.stack(dropna=False)
.rename_axis(['pid','Date','type'])
.reset_index(name='value')
.dropna(subset=['Date'])
.assign(Date = lambda x: pd.to_datetime(x['Date'], dayfirst=True))
.sort_values(['pid','Date','type'])
.reset_index(drop=True)
)
print (df1)
pid Date type value
0 1 2016-11-12 di 21.0
1 1 2016-11-12 sy 123.0
2 1 2016-12-31 di 21.0
3 1 2016-12-31 sy 123.0
4 1 2017-12-31 di 21.0
5 1 2017-12-31 sy 123.0
6 2 2016-12-21 di 24.0
7 2 2016-12-21 sy 125.0
8 2 2016-12-31 di 26.0
9 2 2016-12-31 sy 130.0
10 2 2018-12-31 di 31.0
11 2 2018-12-31 sy 126.0
12 3 2016-12-31 di 28.0
13 3 2016-12-31 sy 135.0
14 3 2019-12-31 di 36.0
15 3 2019-12-31 sy 145.0
16 3 2026-12-31 di 25.0
17 3 2026-12-31 sy 127.0
18 4 2016-12-31 di 30.0
19 4 2016-12-31 sy 145.0
20 4 2116-12-31 di NaN
21 4 2116-12-31 sy NaN
这篇关于从宽到长转换,但重复特定的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!