根据 pandas 中另一列的值创建新列 [英] Creating new columns based on value from another column in pandas

查看:126
本文介绍了根据 pandas 中另一列的值创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个熊猫数据框,其中的代码列包含顺序的分层代码。我的目标是使用每个层次结构代码及其名称创建新列,如下所示:

I have this pandas dataframe with column "Code" that contains the sequential hierarchical code. My goal is to create new columns with each hierarchical level code and its name as followed:

原始数据:

    Code    Name
0   A       USA
1   AM      Massachusetts
2   AMB     Boston
3   AMS     Springfield
4   D       Germany
5   DB      Brandenburg
6   DBB     Berlin
7   DBD     Dresden

我的目标:

Code    Name           Level1   Level1Name      Level2  Level2Name      Level3      Level3Name
0   A   USA             A           USA          AM     Massachusetts   AMB         Boston
1   AM  Massachusetts   A           USA          AM     Massachusetts   AMB         Boston
2   AMB Boston          A           USA          AM     Massachusetts   AMB         Boston
3   AMS Springfield     A           USA          AM     Massachusetts   AMS         Springfiled
4   D   Germany         D           Germany      DB     Brandenburg     DBB         Berlin
5   DB  Brandenburg     D           Germany      DB     Brandenburg     DBB         Berlin
6   DBB Berlin          D           Germany      DB     Brandenburg     DBB         Berlin
7   DBD Dresden         D           Germany      DB     Brandenburg     DBD         Dresden

我的代码:

import pandas as pd
df = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx')
df['Length']=test.Code.str.len() ## create a column with length of each cell in Code
df['Level1']=test.Code.str[:1]   ## create the first level using string indexing
df['Level1Name'] = df[df['Length']==1]['Name']
df.head() ## This yields:



Code    Name          Length    Level1  Level1Name
0   A       USA             1         A     USA
1   AM      Massachusetts   2         A     NaN
2   AMB     Boston          3         A     NaN
3   AMS     Springfield     3         A     NaN
4   D       Germany         1         D     Germany
5   DB      Brandenburg     2         D     NaN
6   DBB     Berlin          3         D     NaN
7   DBD     Dresden         3         D     NaN

对于我目前的方法,如何在Level1Name列中将这些NaN分别转换为美国和德国?

For my current approach, how do I turn those NaN into USA and Germany respectively in Level1Name column?

通常,有没有更好的方法可以达到我为每个层次结构层创建列并将它们与另一列中的名称匹配的目标?

Generally, is there a better approach to reach my goal of creating columns for each hierarchical layer and match them with their respective name in another column?

推荐答案

IIUC,让我们使用以下代码:

IIUC, let's use this code:

df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match').bfill().cumsum(axis=1)
s_map = df.explode('Codes').drop_duplicates('Code', keep='last').set_index('Code')['Name']
df_level.columns = [f'Level{i+1}' for i in df_level.columns]
df_level_names =  pd.concat([df_level[i].map(s_map) for i in df_level.columns], 
                            axis=1, 
                            keys=df_level.columns+'Name')
df_out = df.join([df_level, df_level_names]).drop('Codes', axis=1)
df_out

输出:

  Code           Name Level1 Level2 Level3 Level1Name     Level2Name   Level3Name
0    A            USA      A     AM    AMB        USA  Massachusetts       Boston
1   AM  Massachusetts      A     AM    AMB        USA  Massachusetts       Boston
2  AMB         Boston      A     AM    AMB        USA  Massachusetts       Boston
3  AMS    Springfield      A     AM    AMS        USA  Massachusetts  Springfield
4    D        Germany      D     DB    DBB    Germany    Brandenburg       Berlin
5   DB    Brandenburg      D     DB    DBB    Germany    Brandenburg       Berlin
6  DBB         Berlin      D     DB    DBB    Germany    Brandenburg       Berlin
7  DBD        Dresden      D     DB    DBD    Germany    Brandenburg      Dresden



解释:




  • 将字符串解压缩到创建代码列的字符列表中

  • 使用 extractall 和正则表达式来获得
    单个字符,然后在上方加上 bfill NaN并 cumsum
    的行上创建'LevelX'列

  • 创建一个与列上调用爆炸 drop_duplicates 地图 >保留
    'Code'的最后一个值,然后在'Codes'上保留 set_index 并将'Name'列保留为
    创建's_map'。 / li>
  • 重命名df_level列以获得Level1而不是Level0。

  • pd.concat 中使用使用s_map将 map df_level列的列表理解为
    df_level_names。另外,使用参数重命名
    新列,并附加名称

  • 使用 join 将df与df_levels和df_level_names结合在一起,然后 drop 在代码列中创建所需的输出。

  • Explained:

    • Unpack string into a list of characters creating 'Codes' column
    • Create 'LevelX' columns using extractall and regex . to get a single character, then bfill NaN above and cumsum along rows to create 'LevelX' columns
    • Create a pd.Series to use with map by calling explode on 'Codes' column create above and drop_duplicates keep the last value of 'Code' and then set_index on 'Codes' and keep 'Name' column to create 's_map'.
    • Rename name df_level columns to get Level1 instead of Level0.
    • Use pd.concat with list comprehension to map df_level columns to df_level_names using s_map. Also, using keys parameter to rename new columns and appending 'Name'
    • Use join to join df with df_levels and df_level_names, then drop the 'Codes' column, creating the desired output.
    • 这篇关于根据 pandas 中另一列的值创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆