重新格式化 pandas DataFrame [英] Reformat pandas DataFrame

查看:57
本文介绍了重新格式化 pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas.DataFrame,其中包含以下数据:

I have a pandas.DataFrame with the following data:

 country    branch      Name           salary    mobile no     emailid
    x        a           aa            250000     Null            Null
    x        b           bb            350000    8976646410      xx@xx.com
    y        c           cc            450000    8777945411      yy@yy.com
    y        d           dd            589630     Null            Null

根据某些条件,我过滤了DataFrame(伪代码):

Depending on certain criteria, I filter the DataFrame (pseudocode):

if salary <= 250000: 
    Normal Employee
elif salary >= 250000 and salary <= 600000:
    Experienced Employee

为此,我添加了一个新列,如下所示:

In doing this, I add a new column as follows:

normal = data_df['salary'] <= 250000
experienced = (data_df['salary'] > 250000) &   \
    (data_df['customer_total_sales'] <= 600000)
data_df['position'] = np.where(normal, 'normal', 
    np.where(experienced, 'experienced','unknown'))

但是,我想显示DataFrame如下,删除值为Null的行:

Yet, I would like to display the DataFrame as follows, removing rows with the value Null:

country    branch   count_employee   count_mobile_no    count_email_id   count_normal _employee       count_experienced_employee
  x           a        1                0                     0             1                                0
  y           c        1                1                     1            0                                   1

要计数字段,我使用以下代码:

To count fields, I use the following code:

a = {'employee': ['count'],
     'mobile_number': ['count'],
     'customer_emailid': ['count']}

推荐答案

您可以 replace NullNaN,然后 reset_index :

You can replace Null to NaN and then groupby with agg and last reset_index:

print data_df
  country branch Name  salary   mobile no    emailid position
0       x      a   aa  250000        Null       Null  unknown
1       x      b   bb  350000  8976646410  xx@xx.com  unknown
2       y      c   cc  450000  8777945411  yy@yy.com  unknown
3       y      d   dd  589630        Null       Null  unknown

data_df = data_df.replace('Null', np.nan)
print data_df
  country branch Name  salary   mobile no    emailid position
0       x      a   aa  250000         NaN        NaN  unknown
1       x      b   bb  350000  8976646410  xx@xx.com  unknown
2       y      c   cc  450000  8777945411  yy@yy.com  unknown
3       y      d   dd  589630         NaN        NaN  unknown

df = data_df.groupby(['country', 'branch']).agg({'Name': 'count',
                                                 'mobile no':'count', 
                                                 'emailid': 'count',
                                                 'position': 'count'})

print df.reset_index()
  country branch  emailid  position  Name  mobile no
0       x      a        0         1     1          0
1       x      b        1         1     1          1
2       y      c        1         1     1          1
3       y      d        0         1     1          0

如果需要按category计算位置,请为每个类别创建columns,然后

If you need count positions by category, create columns for each category, then groupby with count, drop column salary and last reset_index:

print data_df
  country branch Name  salary   mobile no    emailid
0       x      a   aa  250000        Null       Null
1       x      a   aa   20000        Null       Null
2       x      b   bb  350000  8976646410  xx@xx.com
3       y      c   cc   45000  8777945411  yy@yy.com
4       y      d   dd  589630        Null       Null

normal = data_df['salary'] <= 20000
experienced = (data_df['salary'] > 20000) & (data_df['salary'] <= 50000)
unknown = data_df['salary'] > 50000

data_df.loc[normal, 'position_normal'] = 'normal employee'
data_df.loc[experienced,'position_experienced'] = 'experienced employee'
data_df.loc[unknown,'position_unknown'] = 'unknown employee'
print data_df
  country branch Name  salary   mobile no    emailid  position_normal  \
0       x      a   aa  250000        Null       Null              NaN   
1       x      a   aa   20000        Null       Null  normal employee   
2       x      b   bb  350000  8976646410  xx@xx.com              NaN   
3       y      c   cc   45000  8777945411  yy@yy.com              NaN   
4       y      d   dd  589630        Null       Null              NaN   

   position_experienced  position_unknown  
0                   NaN  unknown employee  
1                   NaN               NaN  
2                   NaN  unknown employee  
3  experienced employee               NaN  
4                   NaN  unknown employee 

#replace Null to NaN
data_df = data_df.replace('Null', np.nan)
df = data_df.groupby(['country', 'branch']).count()
#remove column salary
df = df.drop('salary', axis=1)

df = df.reset_index()
print df
  country branch  Name  mobile no  emailid  position_normal  \
0       x      a     2          0        0                1   
1       x      b     1          1        1                0   
2       y      c     1          1        1                0   
3       y      d     1          0        0                0   

   position_experienced  position_unknown  
0                     0                 1  
1                     0                 1  
2                     1                 0  
3                     0                 1  

这篇关于重新格式化 pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆