pandas :.groupby().size()和百分比 [英] Pandas: .groupby().size() and percentages

查看：155 发布时间：2020/5/24 0:18:32 python pandas bioinformatics

本文介绍了 pandas :.groupby().size()和百分比的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个源自df.groupby().size()操作的DataFrame，看起来像这样:

I have a DataFrame that originates from a df.groupby().size() operation, and looks like this:

Localization                           RNA level      
cytoplasm                              1 Non-expressed     7
                                       2 Very low         13
                                       3 Low               8
                                       4 Medium            6
                                       5 Moderate          8
                                       6 High              2
                                       7 Very high         6
cytoplasm & nucleus                    1 Non-expressed     5
                                       2 Very low          8
                                       3 Low               2
                                       4 Medium           10
                                       5 Moderate         16
                                       6 High              6
                                       7 Very high         5
cytoplasm & nucleus & plasma membrane  1 Non-expressed     6
                                       2 Very low          3
                                       3 Low               3
                                       4 Medium            7
                                       5 Moderate          8
                                       6 High              4
                                       7 Very high         1

我想要做的是计算单独出现的次数(即，来自.size()的最后一列)占适用的Localization中出现的总数的百分比.

What I want to do is to calculate the separate occurrences (i.e. the last column coming from .size()) as a percentage of the total number of occurrences in the applicable Localization.

例如:cytoplasm定位中总共出现50次(7 + 13 + 8 + 6 + 8 + 2 + 6)，Non-expressed和Very low RNA产生14％和26％级别.

For example: there are a total of 50 occurrences in the cytoplasm localisation (7 + 13 + 8 + 6 + 8 + 2 + 6), yielding 14 and 26 % for the Non-expressed and Very low RNA-levels, respectively.

是否有一个很好的方法来做到这一点?我一直在以一种我认为非常round回的方式进行处理，即为每个Localization都创建一个新的DataFrame并从那里开始工作，但是存在很多局限性，而且必须合并所有生成的DataFrames到底.我希望至少有一种更聪明的方法！

Is there a nice way of doing this? I've been going about it with what I think is a very roundabout way, i.e. making a new DataFrame for every Localization and working on from there, but there's a lot of lines and the problem of having to merge all the resulting DataFrames in the end. I'm hoping there's a smarter way of doing it, at least!

推荐答案

以下是基于熊猫 groupby ， sum 函数. 基本思想是基于'Localization'对数据进行分组并在分组上应用功能.

Here is the complete example based on pandas groupby, sum functions. The basic idea is to group data based on 'Localization' and to apply a function on group.

import pandas as pd
from StringIO import StringIO
#For Python 3: from io import StringIO

data = \
"""Localization,RNA level,Size
cytoplasm                            ,1 Non-expressed, 7
cytoplasm                            ,2 Very low     ,13
cytoplasm                            ,3 Low          , 8
cytoplasm                            ,4 Medium       , 6
cytoplasm                            ,5 Moderate     , 8
cytoplasm                            ,6 High         , 2
cytoplasm                            ,7 Very high    , 6
cytoplasm & nucleus                  ,1 Non-expressed, 5
cytoplasm & nucleus                  ,2 Very low     , 8
cytoplasm & nucleus                  ,3 Low          , 2
cytoplasm & nucleus                  ,4 Medium       ,10
cytoplasm & nucleus                  ,5 Moderate     ,16
cytoplasm & nucleus                  ,6 High         , 6
cytoplasm & nucleus                  ,7 Very high    , 5
cytoplasm & nucleus & plasma membrane,1 Non-expressed, 6
cytoplasm & nucleus & plasma membrane,2 Very low     , 3
cytoplasm & nucleus & plasma membrane,3 Low          , 3
cytoplasm & nucleus & plasma membrane,4 Medium       , 7
cytoplasm & nucleus & plasma membrane,5 Moderate     , 8
cytoplasm & nucleus & plasma membrane,6 High         , 4
cytoplasm & nucleus & plasma membrane,7 Very high    , 1"""

# Create the dataframe
df = pd.read_csv(StringIO(data))
df['Localization'].str.strip()
df['RNA level'].str.strip()
df['Size'].astype(int)
df['Percent'] = df.groupby('Localization')['Size'].transform(lambda x: x/sum(x))

这篇关于 pandas :.groupby().size()和百分比的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas :.groupby().size()和百分比 [英] Pandas: .groupby().size() and percentages

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas :.groupby().size()和百分比 [英] Pandas: .groupby().size() and percentages

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭