pandas :.groupby().size()和百分比 [英] Pandas: .groupby().size() and percentages
问题描述
我有一个源自df.groupby().size()
操作的DataFrame,看起来像这样:
I have a DataFrame that originates from a df.groupby().size()
operation, and looks like this:
Localization RNA level
cytoplasm 1 Non-expressed 7
2 Very low 13
3 Low 8
4 Medium 6
5 Moderate 8
6 High 2
7 Very high 6
cytoplasm & nucleus 1 Non-expressed 5
2 Very low 8
3 Low 2
4 Medium 10
5 Moderate 16
6 High 6
7 Very high 5
cytoplasm & nucleus & plasma membrane 1 Non-expressed 6
2 Very low 3
3 Low 3
4 Medium 7
5 Moderate 8
6 High 4
7 Very high 1
我想要做的是计算单独出现的次数(即,来自.size()
的最后一列)占适用的Localization
中出现的总数的百分比.
What I want to do is to calculate the separate occurrences (i.e. the last column coming from .size()
) as a percentage of the total number of occurrences in the applicable Localization
.
例如:cytoplasm
定位中总共出现50次(7 + 13 + 8 + 6 + 8 + 2 + 6),Non-expressed
和Very low
RNA产生14%和26%级别.
For example: there are a total of 50 occurrences in the cytoplasm
localisation (7 + 13 + 8 + 6 + 8 + 2 + 6), yielding 14 and 26 % for the Non-expressed
and Very low
RNA-levels, respectively.
是否有一个很好的方法来做到这一点?我一直在以一种我认为非常round回的方式进行处理,即为每个Localization
都创建一个新的DataFrame并从那里开始工作,但是存在很多局限性,而且必须合并所有生成的DataFrames到底.我希望至少有一种更聪明的方法!
Is there a nice way of doing this? I've been going about it with what I think is a very roundabout way, i.e. making a new DataFrame for every Localization
and working on from there, but there's a lot of lines and the problem of having to merge all the resulting DataFrames in the end. I'm hoping there's a smarter way of doing it, at least!
推荐答案
以下是基于熊猫 groupby
, sum
函数.
基本思想是基于'Localization'
对数据进行分组并在分组上应用功能.
Here is the complete example based on pandas groupby
, sum
functions.
The basic idea is to group data based on 'Localization'
and to apply a function on group.
import pandas as pd
from StringIO import StringIO
#For Python 3: from io import StringIO
data = \
"""Localization,RNA level,Size
cytoplasm ,1 Non-expressed, 7
cytoplasm ,2 Very low ,13
cytoplasm ,3 Low , 8
cytoplasm ,4 Medium , 6
cytoplasm ,5 Moderate , 8
cytoplasm ,6 High , 2
cytoplasm ,7 Very high , 6
cytoplasm & nucleus ,1 Non-expressed, 5
cytoplasm & nucleus ,2 Very low , 8
cytoplasm & nucleus ,3 Low , 2
cytoplasm & nucleus ,4 Medium ,10
cytoplasm & nucleus ,5 Moderate ,16
cytoplasm & nucleus ,6 High , 6
cytoplasm & nucleus ,7 Very high , 5
cytoplasm & nucleus & plasma membrane,1 Non-expressed, 6
cytoplasm & nucleus & plasma membrane,2 Very low , 3
cytoplasm & nucleus & plasma membrane,3 Low , 3
cytoplasm & nucleus & plasma membrane,4 Medium , 7
cytoplasm & nucleus & plasma membrane,5 Moderate , 8
cytoplasm & nucleus & plasma membrane,6 High , 4
cytoplasm & nucleus & plasma membrane,7 Very high , 1"""
# Create the dataframe
df = pd.read_csv(StringIO(data))
df['Localization'].str.strip()
df['RNA level'].str.strip()
df['Size'].astype(int)
df['Percent'] = df.groupby('Localization')['Size'].transform(lambda x: x/sum(x))
这篇关于 pandas :.groupby().size()和百分比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!