根据Pandas中的堆栈列延长DataFrame的长度 [英] Lengthening a DataFrame based on stacking columns within it in Pandas
问题描述
我正在寻找一种实现以下目的的功能.最好在示例中显示.考虑:
I am looking for a function that achieves the following. It is best shown in an example. Consider:
pd.DataFrame([ [1, 2, 3 ], [4, 5, np.nan ]], columns=['x', 'y1', 'y2'])
如下所示:
x y1 y2
0 1 2 3
1 4 5 NaN
我想折叠y1
和y2
列,在必要时加长DataFame,以便输出为:
I would like to collapase the y1
and y2
columns, lengthening the DataFame where necessary, so that the output is:
x y
0 1 2
1 1 3
2 4 5
即,x
和y1
或x
和y2
之间的每个组合对应一行.我正在寻找一个功能相对有效的函数,因为我有多个y
和许多行.
That is, one row for each combination between either x
and y1
, or x
and y2
. I am looking for a function that does this relatively efficiently, as I have multiple y
s and many rows.
推荐答案
这里是基于NumPy的,您正在寻找性能-
Here's one based on NumPy, as you were looking for performance -
def gather_columns(df):
col_mask = [i.startswith('y') for i in df.columns]
ally_vals = df.iloc[:,col_mask].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})
样品运行-
In [78]: df #(added more cols for variety)
Out[78]:
x y1 y2 y5 y7
0 1 2 3.0 NaN NaN
1 4 5 NaN 6.0 7.0
In [79]: gather_columns(df)
Out[79]:
x y
0 1 2.0
1 1 3.0
2 4 5.0
3 4 6.0
4 4 7.0
如果y
列始终从第二列开始直到结尾,我们可以简单地对数据帧进行切片,从而进一步提高性能,就像这样-
If the y
columns are always starting from the second column onwards until the end, we can simply slice the dataframe and hence get further performance boost, like so -
def gather_columns_v2(df):
ally_vals = df.iloc[:,1:].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})
这篇关于根据Pandas中的堆栈列延长DataFrame的长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!