pandas Python上按组计数的堆积条形图 [英] Stacked Bar Plot By Group Count On Pandas Python

查看:91
本文介绍了 pandas Python上按组计数的堆积条形图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的csv数据类似于下面提供的数据.我想用pandas/python创建一个堆栈条形图,其中每个条形代表两种颜色的男性和女性部分,在条形顶部显示了服用该药物的男性和女性的总数.例如,对于20岁以下的秋季,总共有7个人,其中有6个人是男性,有1个人是女性,因此在条形图上,条形图的顶部应该有7,并且条形图中显示了这6:1的比例,两种颜色.我设法根据人们的年龄对他们进行分组并绘制出来,但是我也想显示带有不同颜色的两种性别的条形图.任何帮助将不胜感激 .谢谢你.

My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.

Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values

df = pd.DataFrame(data)
df2 = pd.merge(df1,df,  left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()

df3 = pd.merge(df1,df,  left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()

ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2.,   p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()

结果是这样的:

推荐答案

这个问题经常出现,因此我决定逐步编写说明.请注意,我不是pandas专家,所以有些事情可能需要优化.

This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.

我首先生成了要用于x轴的年龄列表:

I started by generating getting a list of ages that I will use for my x-axis:

cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''

df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()

array([15, 17, 19, 20, 21, 23, 24])

然后我生成了一个分组的数据框,其中包含每个年龄段中每个M和F的计数:

Then I generated a grouped dataframe with the counts of each M and F per age:

counts = df.groupby(['Age','Gender']).count()
print counts

            Drug_ID
Age Gender         
15  F             1
17  M             1
19  M             2
20  F             1
    M             6
21  F             1
    M             3
23  F             3
    M             4
24  F             3
    M             2

以此为基础,我可以轻松地计算出每个年龄段的个人总数:

Using that, I can easily calculate the total number of individual per age group:

totals = counts.sum(level=0)
print totals

     Drug_ID
Age         
15         1
17         1
19         2
20         7
21         4
23         7
24         5

为准备绘图,我将转换counts数据框,以按列而不是按索引分隔每个性别.在这里,我还删除了"Drug_ID"列名称,因为unstack()操作会创建一个MultiIndex,并且在没有该MultiIndex的情况下操作数据框要容易得多.

To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.

counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts

Gender    F    M
Age             
15      1.0  NaN
17      NaN  1.0
19      NaN  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0

看起来不错.我将做最后的细化,并将NaN替换为0.

Looks pretty good. I'll just do a final refinement and replace the NaN by 0.

counts = counts.fillna(0)
print counts

Gender    F    M
Age             
15      1.0  0.0
17      0.0  1.0
19      0.0  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0

使用此数据框,绘制堆叠的条形图很简单:

With this dataframe, it is trivial to plot the stacked bars:

plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')

要在条形图上绘制总计数,我们将使用annotate()函数.我们不能一口气完成它,相反,我们将遍历agestotals(为简单起见,我将valuesflatten()用作它们,因为它们的格式不正确,不太清楚为什么会在这里

To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)

for age,tot in zip(ages,totals.values.flatten()):
    plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')

注释的坐标为(age+0.4, tot),因为默认情况下条形图从xx+width,而width=0.8则变为x+width,因此x+0.4是条形图的中心,而tot当然是酒吧的全高.为了稍微偏移文本,我在y方向上偏移了几(5)个点.根据自己的喜好进行调整.

the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.

查看 bar() 的文档以调整参数条形图. 查看 annotate() 的文档以自定义注释

Check out the documentation for bar() to adjust the parameters of the bar plots. Check out the documentation for annotate() to customize your annotations

这篇关于 pandas Python上按组计数的堆积条形图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆