pandas :.groupby().size()和百分比 [英] Pandas: .groupby().size() and percentages

查看:155
本文介绍了 pandas :.groupby().size()和百分比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个源自df.groupby().size()操作的DataFrame,看起来像这样:

I have a DataFrame that originates from a df.groupby().size() operation, and looks like this:

Localization                           RNA level      
cytoplasm                              1 Non-expressed     7
                                       2 Very low         13
                                       3 Low               8
                                       4 Medium            6
                                       5 Moderate          8
                                       6 High              2
                                       7 Very high         6
cytoplasm & nucleus                    1 Non-expressed     5
                                       2 Very low          8
                                       3 Low               2
                                       4 Medium           10
                                       5 Moderate         16
                                       6 High              6
                                       7 Very high         5
cytoplasm & nucleus & plasma membrane  1 Non-expressed     6
                                       2 Very low          3
                                       3 Low               3
                                       4 Medium            7
                                       5 Moderate          8
                                       6 High              4
                                       7 Very high         1

我想要做的是计算单独出现的次数(即,来自.size()的最后一列)占适用的Localization中出现的总数的百分比.

What I want to do is to calculate the separate occurrences (i.e. the last column coming from .size()) as a percentage of the total number of occurrences in the applicable Localization.

例如:cytoplasm定位中总共出现50次(7 + 13 + 8 + 6 + 8 + 2 + 6),Non-expressedVery low RNA产生14%和26%级别.

For example: there are a total of 50 occurrences in the cytoplasm localisation (7 + 13 + 8 + 6 + 8 + 2 + 6), yielding 14 and 26 % for the Non-expressed and Very low RNA-levels, respectively.

是否有一个很好的方法来做到这一点?我一直在以一种我认为非常round回的方式进行处理,即为每个Localization都创建一个新的DataFrame并从那里开始工作,但是存在很多局限性,而且必须合并所有生成的DataFrames到底.我希望至少有一种更聪明的方法!

Is there a nice way of doing this? I've been going about it with what I think is a very roundabout way, i.e. making a new DataFrame for every Localization and working on from there, but there's a lot of lines and the problem of having to merge all the resulting DataFrames in the end. I'm hoping there's a smarter way of doing it, at least!

推荐答案

以下是基于熊猫 groupby sum 函数. 基本思想是基于'Localization'对数据进行分组并在分组上应用功能.

Here is the complete example based on pandas groupby, sum functions. The basic idea is to group data based on 'Localization' and to apply a function on group.

import pandas as pd
from StringIO import StringIO
#For Python 3: from io import StringIO

data = \
"""Localization,RNA level,Size
cytoplasm                            ,1 Non-expressed, 7
cytoplasm                            ,2 Very low     ,13
cytoplasm                            ,3 Low          , 8
cytoplasm                            ,4 Medium       , 6
cytoplasm                            ,5 Moderate     , 8
cytoplasm                            ,6 High         , 2
cytoplasm                            ,7 Very high    , 6
cytoplasm & nucleus                  ,1 Non-expressed, 5
cytoplasm & nucleus                  ,2 Very low     , 8
cytoplasm & nucleus                  ,3 Low          , 2
cytoplasm & nucleus                  ,4 Medium       ,10
cytoplasm & nucleus                  ,5 Moderate     ,16
cytoplasm & nucleus                  ,6 High         , 6
cytoplasm & nucleus                  ,7 Very high    , 5
cytoplasm & nucleus & plasma membrane,1 Non-expressed, 6
cytoplasm & nucleus & plasma membrane,2 Very low     , 3
cytoplasm & nucleus & plasma membrane,3 Low          , 3
cytoplasm & nucleus & plasma membrane,4 Medium       , 7
cytoplasm & nucleus & plasma membrane,5 Moderate     , 8
cytoplasm & nucleus & plasma membrane,6 High         , 4
cytoplasm & nucleus & plasma membrane,7 Very high    , 1"""

# Create the dataframe
df = pd.read_csv(StringIO(data))
df['Localization'].str.strip()
df['RNA level'].str.strip()
df['Size'].astype(int)
df['Percent'] = df.groupby('Localization')['Size'].transform(lambda x: x/sum(x))

这篇关于 pandas :.groupby().size()和百分比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆