从宽到长转换，但重复特定的列 [英] Transform wide to long but with repetition of a specific column

查看：57 发布时间：2020/10/17 0:32:22 python python-3.x pandas dataframe pandas-groupby

本文介绍了从宽到长转换，但重复特定的列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个如下所示的数据帧

I have a dataframe as shown below

df2 = pd.DataFrame({'pid':[1,2,3,4],'BP1Date':['12/11/2016','12/21/2016','12/31/2026',np.nan],'BP1di':[21,24,25,np.nan],'BP1sy':[123,125,127,np.nan],'BP2Date':['12/31/2016','12/31/2016','12/31/2016','12/31/2016'],'BP2di':[21,26,28,30],'BP2sy':[123,130,135,145],
                   'BP3Date':['12/31/2017','12/31/2018','12/31/2019','12/31/2116'],'BP3di':[21,31,36,np.nan],'BP3sy':[123,126,145,np.nan]})

如下所示

我希望我的输出如下所示

I expect my output to be like as shown below

这是我根据S尝试的结果O，其他职位的建议，但我无法产生或接近预期的产出

This is what I tried based on SO suggestions from other posts but I am unable to produce or get close to the expected output

df = pd.melt(df2, id_vars='pid', var_name='col', value_name='dates')
df['col2'] = [x.split("Date")[0][:3] for x in df['col']]
df = df[df.groupby(['pid','col2'])['dates'].transform('count').ne(0)].copy()
df['col3'] = df['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df.sort_values(by=['pid','col3'])

请注意两件事

a）对于每个日期，我都有两个读数（BP {n} di，BP {n} si）

a) For each date, I have two readings (BP{n}di, BP{n}si)

b）我想只在所有3列都为NA时才删除NA记录（在这种情况下，对于pid = 4，BP1Date，BP1di，BP1sy为NA）。如果任何列都不是NA，则应保留NA，如下所示。因此，我没有使用stack（dropna = False），而是基于SO帖子使用了pd.melt

b) I would like to drop NA records only when it is NA for all 3 columns together (In this case for pid = 4, BP1Date, BP1di, BP1sy is NA). If any of the column is not NA, then the NA should be retained as shown below. Hence I didn't use stack(dropna=False) instead I am using pd.melt based on SO posts

如何转换输入以实现输出，如图所示

How can I transform the input to achieve the output as shown above in screenshot?

根据答案评论更新了屏幕截图

推荐答案

在 lreshape 中使用 DataFrame.stack 进行重塑，然后通过Date 列删除缺失的值-docs / stable / reference / api / pandas.DataFrame.dropna.html rel = nofollow noreferrer> DataFrame.dropna 并按前三列进行排序：

Use lreshape with DataFrame.stack for reshape, then remove missing values by Date column by DataFrame.dropna and sorting by first 3 columns:

a = [col for col in df2.columns if col.endswith('Date')]
b = [col for col in df2.columns if col.endswith('di')]
c = [col for col in df2.columns if col.endswith('sy')]

df1 = (pd.lreshape(df2, {'Date':a, 'di':b, 'sy':c}, dropna=False)
       .set_index(['pid','Date'])
       .stack(dropna=False)
       .rename_axis(['pid','Date','type'])
       .reset_index(name='value')
       .dropna(subset=['Date'])
       .assign(Date = lambda x: pd.to_datetime(x['Date'], dayfirst=True))
       .sort_values(['pid','Date','type'])
       .reset_index(drop=True)
       )

print (df1)
    pid       Date type  value
0     1 2016-11-12   di   21.0
1     1 2016-11-12   sy  123.0
2     1 2016-12-31   di   21.0
3     1 2016-12-31   sy  123.0
4     1 2017-12-31   di   21.0
5     1 2017-12-31   sy  123.0
6     2 2016-12-21   di   24.0
7     2 2016-12-21   sy  125.0
8     2 2016-12-31   di   26.0
9     2 2016-12-31   sy  130.0
10    2 2018-12-31   di   31.0
11    2 2018-12-31   sy  126.0
12    3 2016-12-31   di   28.0
13    3 2016-12-31   sy  135.0
14    3 2019-12-31   di   36.0
15    3 2019-12-31   sy  145.0
16    3 2026-12-31   di   25.0
17    3 2026-12-31   sy  127.0
18    4 2016-12-31   di   30.0
19    4 2016-12-31   sy  145.0
20    4 2116-12-31   di    NaN
21    4 2116-12-31   sy    NaN

替代解决方案是使用MultiIndex .Series.str.extract.html rel = nofollow noreferrer> Series.str.extract 和 MultiIndex.from_tuples ：

Alternative solution is with MultiIndex in columns created by Series.str.extract and MultiIndex.from_tuples:

df2 = df2.set_index('pid')

c = df2.columns.to_frame(name='orig')
c = c['orig'].str.extract('(.+)(Date|di|sy)').apply(tuple, 1)

df2.columns = pd.MultiIndex.from_tuples(c)

df1 = (df2.stack(0)
       .set_index(['Date'], append=True)
       .reset_index(level=1, drop=True)
       .stack(dropna=False)
       .rename_axis(['pid','Date','type'])
       .reset_index(name='value')
       .dropna(subset=['Date'])
       .assign(Date = lambda x: pd.to_datetime(x['Date'], dayfirst=True))
       .sort_values(['pid','Date','type'])
       .reset_index(drop=True)
       )

print (df1)
    pid       Date type  value
0     1 2016-11-12   di   21.0
1     1 2016-11-12   sy  123.0
2     1 2016-12-31   di   21.0
3     1 2016-12-31   sy  123.0
4     1 2017-12-31   di   21.0
5     1 2017-12-31   sy  123.0
6     2 2016-12-21   di   24.0
7     2 2016-12-21   sy  125.0
8     2 2016-12-31   di   26.0
9     2 2016-12-31   sy  130.0
10    2 2018-12-31   di   31.0
11    2 2018-12-31   sy  126.0
12    3 2016-12-31   di   28.0
13    3 2016-12-31   sy  135.0
14    3 2019-12-31   di   36.0
15    3 2019-12-31   sy  145.0
16    3 2026-12-31   di   25.0
17    3 2026-12-31   sy  127.0
18    4 2016-12-31   di   30.0
19    4 2016-12-31   sy  145.0
20    4 2116-12-31   di    NaN
21    4 2116-12-31   sy    NaN

这篇关于从宽到长转换，但重复特定的列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从宽到长转换，但重复特定的列 [英] Transform wide to long but with repetition of a specific column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从宽到长转换，但重复特定的列 [英] Transform wide to long but with repetition of a specific column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭