在 pandas/python 的同一数据框中将两列合并为一列 [英] Merge two columns into one within the same data frame in pandas/python

查看:78
本文介绍了在 pandas/python 的同一数据框中将两列合并为一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题要在同一个数据帧(start_end)中将两列合并为一列,还要删除空值.我打算将起点站"和终点站"合并到站"中,并根据新的站"列保留持续时间".我尝试过 pd.merge、pd.concat、pd.append,但我无法解决.

Start_end 的数据帧:

 持续时间 终点站 起点站14 1407 NaN 14th &V St NW19 509 NaN 21st &我圣西北20 638 15 日P St NW.NaN27 1532 NaN 马萨诸塞州大道 &杜邦环西北28 759 NaN Adams Mill &哥伦比亚路西北

预期输出:

 持续时间站14 1407 14 日V St NW19 509 21 日我圣西北20 638 15 日圣西北27 1532 马萨诸塞州大道 &杜邦环西北28 759 亚当斯磨坊哥伦比亚路西北

我目前拥有的代码:

#start_end 是数据帧,'start station', 'end station', 'duration'start_end = pd.concat([df_start, df_end])

这就是我试图做的:

station = pd.merge([start_end['Start station'],start_end['End station']])

解决方案

>>>df持续时间 终点站 起点站0 1407 NaN 14th &V St NW1 509 NaN 21st &我圣西北2 638 15 日 &P St NW.NaN3 1532 NaN 马萨诸塞州大道 &杜邦环西北4 759 NaN Adams Mill &哥伦比亚路西北

为两列指定相同的名称

<预><代码>>>>df.columns = df.columns.str.replace('.*?station', 'station')>>>df时长车站0 1407 NaN 14th &V St NW1 509 NaN 21st &我圣西北2 638 15 日 &P St NW.NaN3 1532 NaN 马萨诸塞州大道 &杜邦环西北4 759 NaN Adams Mill &哥伦比亚路西北

先堆叠,然后取消堆叠.

<预><代码>>>>s = df.stack()>>>秒0 持续时间 1407车站 14 号 &V St NW1 持续时间 509车站 21 号 &我圣西北2 持续时间 638车站 15 号 &P St NW.3 持续时间 1532马萨诸塞大街车站杜邦环西北4 持续时间 759站亚当斯磨坊哥伦比亚路西北数据类型:对象>>>df = s.unstack()>>>df持续时间站0 1407 14 日 &V St NW1 509 21 日 &我圣西北2 638 15 日 &P St NW.3 1532 马萨诸塞州大道 &杜邦环西北4 759 Adams Mill &哥伦比亚路西北>>>

这就是我认为的工作方式:

.stack 创建一个带有 MultiIndex 的系列并为您处理空值.它在列名上对齐第二级,因为列名相同,所以只有一个 - 取消堆叠只会产生一个列.

如果不更改列名,那实际上只是基于索引之间差异的猜测.

<预><代码>>>># 不改变列名>>>指数MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'End station', 'Start station']],标签=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 2, 0, 2, 0, 1, 0, 2, 0, 2]])>>># 列名相同>>>指数MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'station']],标签=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]])

看起来有点棘手,也许有人会评论它.

<小时>

替代方案 - 使用 pd.concat.dropna

<预><代码>>>>Stations = pd.concat([df.iloc[:,1],df.iloc[:,2]]).dropna()>>>Stations.name = '电台'>>>车站2 15 日 &P St NW.0 14 日 &V St NW1 21 日 &我圣西北3 马萨诸塞州大道 &杜邦环西北4 亚当斯磨坊哥伦比亚路西北名称:站,数据类型:对象>>>df2 = pd.concat([df['Duration'],stations],axis=1)>>>df2持续时间站0 1407 14 日 &V St NW1 509 21 日 &我圣西北2 638 15 日 &P St NW.3 1532 马萨诸塞州大道 &杜邦环西北4 759 Adams Mill &哥伦比亚路西北

I have a question to merge two columns into one in the same dataframe(start_end), also remove null value. I intend to merge 'Start station' and 'End station' into 'station', and keep 'duration' according to the new column 'station'. I have tried pd.merge, pd.concat, pd.append, but I cannot work it out.

dataFrame of Start_end:

    Duration    End station     Start station
14  1407        NaN             14th & V St NW
19  509         NaN             21st & I St NW
20  638         15th & P St NW.  NaN
27  1532        NaN              Massachusetts Ave & Dupont Circle NW
28  759         NaN              Adams Mill & Columbia Rd NW

Expected output:

    Duration    stations
14  1407        14th & V St NW
19  509         21st & I St NW
20  638         15th & P St NW
27  1532        Massachusetts Ave & Dupont Circle NW
28  759         Adams Mill & Columbia Rd NW

Code i have so far:

#start_end is the dataframe, 'start station', 'end station', 'duration'
start_end = pd.concat([df_start, df_end])

This is what I attempted to:

station = pd.merge([start_end['Start station'],start_end['End station']])

解决方案

>>> df
   Duration      End station                         Start station
0      1407              NaN                        14th & V St NW
1       509              NaN                        21st & I St NW
2       638  15th & P St NW.                                   NaN
3      1532              NaN  Massachusetts Ave & Dupont Circle NW
4       759              NaN           Adams Mill & Columbia Rd NW

Give the two columns the same name

>>> df.columns = df.columns.str.replace('.*?station', 'station')
>>> df
   Duration          station                               station
0      1407              NaN                        14th & V St NW
1       509              NaN                        21st & I St NW
2       638  15th & P St NW.                                   NaN
3      1532              NaN  Massachusetts Ave & Dupont Circle NW
4       759              NaN           Adams Mill & Columbia Rd NW

Stack then unstack.

>>> s = df.stack()
>>> s
0  Duration                                    1407
   station                           14th & V St NW
1  Duration                                     509
   station                           21st & I St NW
2  Duration                                     638
   station                          15th & P St NW.
3  Duration                                    1532
   station     Massachusetts Ave & Dupont Circle NW
4  Duration                                     759
   station              Adams Mill & Columbia Rd NW
dtype: object
>>> df = s.unstack()
>>> df
  Duration                               station
0     1407                        14th & V St NW
1      509                        21st & I St NW
2      638                       15th & P St NW.
3     1532  Massachusetts Ave & Dupont Circle NW
4      759           Adams Mill & Columbia Rd NW
>>> 

This is how I think this works:

.stack creates a series with a MultiIndex and takes care of the null values for you. It aligns the second level on the column names and because the column names are the same there is only one - unstacking just produces a single column.

That's really just a guess based on the differences between Index's if you don't change the column names.

>>> # without changing column names
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'End station', 'Start station']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 2, 0, 2, 0, 1, 0, 2, 0, 2]])

>>> # column names the same
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'station']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]])

Seems a bit tricky, maybe someone will comment on it.


Alternative - Using pd.concat and .dropna

>>> stations = pd.concat([df.iloc[:,1],df.iloc[:,2]]).dropna()
>>> stations.name = 'stations'
>>> stations
2                         15th & P St NW.
0                          14th & V St NW
1                          21st & I St NW
3    Massachusetts Ave & Dupont Circle NW
4             Adams Mill & Columbia Rd NW
Name: stations, dtype: object

>>> df2 = pd.concat([df['Duration'], stations], axis=1)
>>> df2
   Duration                              stations
0      1407                        14th & V St NW
1       509                        21st & I St NW
2       638                       15th & P St NW.
3      1532  Massachusetts Ave & Dupont Circle NW
4       759           Adams Mill & Columbia Rd NW

这篇关于在 pandas/python 的同一数据框中将两列合并为一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆