pandas Python上按组计数的堆积条形图 [英] Stacked Bar Plot By Group Count On Pandas Python
问题描述
我的csv数据类似于下面提供的数据.我想用pandas/python创建一个堆栈条形图,其中每个条形代表两种颜色的男性和女性部分,在条形顶部显示了服用该药物的男性和女性的总数.例如,对于20岁以下的秋季,总共有7个人,其中有6个人是男性,有1个人是女性,因此在条形图上,条形图的顶部应该有7,并且条形图中显示了这6:1的比例,两种颜色.我设法根据人们的年龄对他们进行分组并绘制出来,但是我也想显示带有不同颜色的两种性别的条形图.任何帮助将不胜感激 .谢谢你.
My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.
Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values
df = pd.DataFrame(data)
df2 = pd.merge(df1,df, left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()
df3 = pd.merge(df1,df, left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()
ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
结果是这样的:
推荐答案
这个问题经常出现,因此我决定逐步编写说明.请注意,我不是pandas
专家,所以有些事情可能需要优化.
This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas
guru, so there are things that could probably be optimized.
我首先生成了要用于x轴的年龄列表:
I started by generating getting a list of ages that I will use for my x-axis:
cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''
df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()
array([15, 17, 19, 20, 21, 23, 24])
然后我生成了一个分组的数据框,其中包含每个年龄段中每个M和F的计数:
Then I generated a grouped dataframe with the counts of each M and F per age:
counts = df.groupby(['Age','Gender']).count()
print counts
Drug_ID
Age Gender
15 F 1
17 M 1
19 M 2
20 F 1
M 6
21 F 1
M 3
23 F 3
M 4
24 F 3
M 2
以此为基础,我可以轻松地计算出每个年龄段的个人总数:
Using that, I can easily calculate the total number of individual per age group:
totals = counts.sum(level=0)
print totals
Drug_ID
Age
15 1
17 1
19 2
20 7
21 4
23 7
24 5
为准备绘图,我将转换counts
数据框,以按列而不是按索引分隔每个性别.在这里,我还删除了"Drug_ID"列名称,因为unstack()
操作会创建一个MultiIndex,并且在没有该MultiIndex的情况下操作数据框要容易得多.
To prepare for plotting, I'll transform my counts
dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack()
operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.
counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts
Gender F M
Age
15 1.0 NaN
17 NaN 1.0
19 NaN 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
看起来不错.我将做最后的细化,并将NaN
替换为0.
Looks pretty good. I'll just do a final refinement and replace the NaN
by 0.
counts = counts.fillna(0)
print counts
Gender F M
Age
15 1.0 0.0
17 0.0 1.0
19 0.0 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
使用此数据框,绘制堆叠的条形图很简单:
With this dataframe, it is trivial to plot the stacked bars:
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
要在条形图上绘制总计数,我们将使用annotate()
函数.我们不能一口气完成它,相反,我们将遍历ages
和totals
(为简单起见,我将values
和flatten()
用作它们,因为它们的格式不正确,不太清楚为什么会在这里
To plot the total counts on top of the bars, we'll use the annotate()
function. We cannot do it in one single pass, instead we'll loop through the ages
and the totals
(for simplicity sake, I take the values
and flatten()
them because they're not quite in the right format, not exactly sure why here)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
注释的坐标为(age+0.4, tot)
,因为默认情况下条形图从x
到x+width
,而width=0.8
则变为x+width
,因此x+0.4
是条形图的中心,而tot
当然是酒吧的全高.为了稍微偏移文本,我在y方向上偏移了几(5)个点.根据自己的喜好进行调整.
the coordinates for the annotations are (age+0.4, tot)
because the bars go from x
to x+width
with width=0.8
by default, and therefore x+0.4
is the center of the bar, while tot
is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.
查看 bar()
的文档以调整参数条形图.
查看 annotate()
的文档以自定义注释
Check out the documentation for bar()
to adjust the parameters of the bar plots.
Check out the documentation for annotate()
to customize your annotations
这篇关于 pandas Python上按组计数的堆积条形图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!