在pandas/python的同一数据框中将两列合并为一列 [英] Merge two columns into one within the same data frame in pandas/python
问题描述
我有一个问题,要将两列合并到同一数据帧中(start_end),也要删除空值.我打算将起始站"和结束站"合并为站",并根据新列站"保留持续时间".我已经尝试过pd.merge,pd.concat,pd.append,但是我无法解决.
I have a question to merge two columns into one in the same dataframe(start_end), also remove null value. I intend to merge 'Start station' and 'End station' into 'station', and keep 'duration' according to the new column 'station'. I have tried pd.merge, pd.concat, pd.append, but I cannot work it out.
Start_end的dataFrame:
dataFrame of Start_end:
Duration End station Start station
14 1407 NaN 14th & V St NW
19 509 NaN 21st & I St NW
20 638 15th & P St NW. NaN
27 1532 NaN Massachusetts Ave & Dupont Circle NW
28 759 NaN Adams Mill & Columbia Rd NW
预期输出:
Duration stations
14 1407 14th & V St NW
19 509 21st & I St NW
20 638 15th & P St NW
27 1532 Massachusetts Ave & Dupont Circle NW
28 759 Adams Mill & Columbia Rd NW
我到目前为止的代码:
#start_end is the dataframe, 'start station', 'end station', 'duration'
start_end = pd.concat([df_start, df_end])
这就是我试图做到的:
station = pd.merge([start_end['Start station'],start_end['End station']])
推荐答案
>>> df
Duration End station Start station
0 1407 NaN 14th & V St NW
1 509 NaN 21st & I St NW
2 638 15th & P St NW. NaN
3 1532 NaN Massachusetts Ave & Dupont Circle NW
4 759 NaN Adams Mill & Columbia Rd NW
为两列提供相同的名称
>>> df.columns = df.columns.str.replace('.*?station', 'station')
>>> df
Duration station station
0 1407 NaN 14th & V St NW
1 509 NaN 21st & I St NW
2 638 15th & P St NW. NaN
3 1532 NaN Massachusetts Ave & Dupont Circle NW
4 759 NaN Adams Mill & Columbia Rd NW
堆叠然后拆栈.
>>> s = df.stack()
>>> s
0 Duration 1407
station 14th & V St NW
1 Duration 509
station 21st & I St NW
2 Duration 638
station 15th & P St NW.
3 Duration 1532
station Massachusetts Ave & Dupont Circle NW
4 Duration 759
station Adams Mill & Columbia Rd NW
dtype: object
>>> df = s.unstack()
>>> df
Duration station
0 1407 14th & V St NW
1 509 21st & I St NW
2 638 15th & P St NW.
3 1532 Massachusetts Ave & Dupont Circle NW
4 759 Adams Mill & Columbia Rd NW
>>>
这是我认为的工作原理:
.stack
创建一个具有MultiIndex的序列,并为您处理null值.它使列名的第二层对齐,并且因为列名相同,所以只有一个-取消堆叠只会产生一列.
.stack
creates a series with a MultiIndex and takes care of the null values for you. It aligns the second level on the column names and because the column names are the same there is only one - unstacking just produces a single column.
如果您不更改列名,那实际上只是基于Index之间差异的猜测.
That's really just a guess based on the differences between Index's if you don't change the column names.
>>> # without changing column names
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'End station', 'Start station']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 2, 0, 2, 0, 1, 0, 2, 0, 2]])
>>> # column names the same
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'station']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]])
似乎有些棘手,也许有人会对此发表评论.
Seems a bit tricky, maybe someone will comment on it.
替代-使用pd.concat
和.dropna
>>> stations = pd.concat([df.iloc[:,1],df.iloc[:,2]]).dropna()
>>> stations.name = 'stations'
>>> stations
2 15th & P St NW.
0 14th & V St NW
1 21st & I St NW
3 Massachusetts Ave & Dupont Circle NW
4 Adams Mill & Columbia Rd NW
Name: stations, dtype: object
>>> df2 = pd.concat([df['Duration'], stations], axis=1)
>>> df2
Duration stations
0 1407 14th & V St NW
1 509 21st & I St NW
2 638 15th & P St NW.
3 1532 Massachusetts Ave & Dupont Circle NW
4 759 Adams Mill & Columbia Rd NW
这篇关于在pandas/python的同一数据框中将两列合并为一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!