重新格式化 pandas DataFrame [英] Reformat pandas DataFrame
问题描述
我有一个pandas
.DataFrame
,其中包含以下数据:
I have a pandas
.DataFrame
with the following data:
country branch Name salary mobile no emailid
x a aa 250000 Null Null
x b bb 350000 8976646410 xx@xx.com
y c cc 450000 8777945411 yy@yy.com
y d dd 589630 Null Null
根据某些条件,我过滤了DataFrame
(伪代码):
Depending on certain criteria, I filter the DataFrame
(pseudocode):
if salary <= 250000:
Normal Employee
elif salary >= 250000 and salary <= 600000:
Experienced Employee
为此,我添加了一个新列,如下所示:
In doing this, I add a new column as follows:
normal = data_df['salary'] <= 250000
experienced = (data_df['salary'] > 250000) & \
(data_df['customer_total_sales'] <= 600000)
data_df['position'] = np.where(normal, 'normal',
np.where(experienced, 'experienced','unknown'))
但是,我想显示DataFrame
如下,删除值为Null
的行:
Yet, I would like to display the DataFrame
as follows, removing rows with the value Null
:
country branch count_employee count_mobile_no count_email_id count_normal _employee count_experienced_employee
x a 1 0 0 1 0
y c 1 1 1 0 1
要计数字段,我使用以下代码:
To count fields, I use the following code:
a = {'employee': ['count'],
'mobile_number': ['count'],
'customer_emailid': ['count']}
推荐答案
您可以 replace
Null
到NaN
,然后 agg
和最后一个 reset_index
:
You can replace
Null
to NaN
and then groupby
with agg
and last reset_index
:
print data_df
country branch Name salary mobile no emailid position
0 x a aa 250000 Null Null unknown
1 x b bb 350000 8976646410 xx@xx.com unknown
2 y c cc 450000 8777945411 yy@yy.com unknown
3 y d dd 589630 Null Null unknown
data_df = data_df.replace('Null', np.nan)
print data_df
country branch Name salary mobile no emailid position
0 x a aa 250000 NaN NaN unknown
1 x b bb 350000 8976646410 xx@xx.com unknown
2 y c cc 450000 8777945411 yy@yy.com unknown
3 y d dd 589630 NaN NaN unknown
df = data_df.groupby(['country', 'branch']).agg({'Name': 'count',
'mobile no':'count',
'emailid': 'count',
'position': 'count'})
print df.reset_index()
country branch emailid position Name mobile no
0 x a 0 1 1 0
1 x b 1 1 1 1
2 y c 1 1 1 1
3 y d 0 1 1 0
如果需要按category
计算位置,请为每个类别创建columns
,然后 count
, reset_index
:
If you need count positions by category
, create columns
for each category, then groupby
with count
, drop
column salary
and last reset_index
:
print data_df
country branch Name salary mobile no emailid
0 x a aa 250000 Null Null
1 x a aa 20000 Null Null
2 x b bb 350000 8976646410 xx@xx.com
3 y c cc 45000 8777945411 yy@yy.com
4 y d dd 589630 Null Null
normal = data_df['salary'] <= 20000
experienced = (data_df['salary'] > 20000) & (data_df['salary'] <= 50000)
unknown = data_df['salary'] > 50000
data_df.loc[normal, 'position_normal'] = 'normal employee'
data_df.loc[experienced,'position_experienced'] = 'experienced employee'
data_df.loc[unknown,'position_unknown'] = 'unknown employee'
print data_df
country branch Name salary mobile no emailid position_normal \
0 x a aa 250000 Null Null NaN
1 x a aa 20000 Null Null normal employee
2 x b bb 350000 8976646410 xx@xx.com NaN
3 y c cc 45000 8777945411 yy@yy.com NaN
4 y d dd 589630 Null Null NaN
position_experienced position_unknown
0 NaN unknown employee
1 NaN NaN
2 NaN unknown employee
3 experienced employee NaN
4 NaN unknown employee
#replace Null to NaN
data_df = data_df.replace('Null', np.nan)
df = data_df.groupby(['country', 'branch']).count()
#remove column salary
df = df.drop('salary', axis=1)
df = df.reset_index()
print df
country branch Name mobile no emailid position_normal \
0 x a 2 0 0 1
1 x b 1 1 1 0
2 y c 1 1 1 0
3 y d 1 0 0 0
position_experienced position_unknown
0 0 1
1 0 1
2 1 0
3 0 1
这篇关于重新格式化 pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!