如何在 Pandas 数据框中展开一列 [英] How to spread a column in a Pandas data frame

查看:45
本文介绍了如何在 Pandas 数据框中展开一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下熊猫数据框:

将pandas导入为pd将 numpy 导入为 npdf = pd.DataFrame({'fc': [100,100,112,1.3,14,125],'sample_id': ['S1','S1','S1','S2','S2','S2'],'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c'],})df = df[['gene_symbol', 'sample_id', 'fc']]df

产生这个:

输出[11]:gene_symbol sample_id fc0 一个 S1 100.01 b S1 100.02 c S1 112.03 一个 S2 1.34 b S2 14.05 c S2 125.0

我如何传播 sample_id 以便最终我得到这个:

gene_symbol S1 S2一个 100 1.3b 100 14.0c 112 125.0

解决方案

使用 pivotunstack:

#df = df[['gene_symbol', 'sample_id', 'fc']]df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')打印 (df)sample_id S1 S2基因符号100.0 1.3100.0 14.0112.0 125.0

<小时>

df = df.set_index(['gene_symbol','sample_id'])['fc'].unstack(fill_value=0)打印 (df)sample_id S1 S2基因符号100.0 1.3100.0 14.0112.0 125.0

但如果重复,需要pivot_table 或与 groupby 或 聚合,mean 可以更改为 sum, median, ...:

df = pd.DataFrame({'fc': [100,100,112,1.3,14,125, 100],'sample_id': ['S1','S1','S1','S2','S2','S2','S2'],'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c', 'c'],})打印 (df)fc gene_symbol sample_id0 100.0 一个 S11 100.0 b S12 112.0 c S13 1.3 一个 S24 14.0 b S25 125.0 c S2 <-相同的 c,S2,不同的 fc6 100.0 c S2 <-相同的 c,S2,不同的 fc

df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')

<块引用>

ValueError: 索引包含重复条目,无法重塑

df = df.pivot_table(index='gene_symbol',columns='sample_id',values='fc', aggfunc='mean')打印 (df)sample_id S1 S2基因符号100.0 1.3100.0 14.0112.0 112.5

<小时>

df = df.groupby(['gene_symbol','sample_id'])['fc'].mean().unstack(fill_value=0)打印 (df)sample_id S1 S2基因符号100.0 1.3100.0 14.0112.0 112.5

用于清理将 columns name 设置为 Nonereset_index:

df.columns.name = 无df = df.reset_index()打印 (df)基因符号 S1 S20 100.0 1.31 分 100.0 14.02 c 112.0 112.5

I have the following pandas data frame:

import pandas as pd
import numpy as np
df = pd.DataFrame({
               'fc': [100,100,112,1.3,14,125],
               'sample_id': ['S1','S1','S1','S2','S2','S2'],
               'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c'],
               })

df = df[['gene_symbol', 'sample_id', 'fc']]
df

Which produces this:

Out[11]:
  gene_symbol sample_id     fc
0           a        S1  100.0
1           b        S1  100.0
2           c        S1  112.0
3           a        S2    1.3
4           b        S2   14.0
5           c        S2  125.0

How can I spread sample_id so that in the end I get this:

gene_symbol    S1   S2
a             100   1.3
b             100   14.0
c             112   125.0

解决方案

Use pivot or unstack:

#df = df[['gene_symbol', 'sample_id', 'fc']]
df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')
print (df)
sample_id       S1     S2
gene_symbol              
a            100.0    1.3
b            100.0   14.0
c            112.0  125.0


df = df.set_index(['gene_symbol','sample_id'])['fc'].unstack(fill_value=0)
print (df)
sample_id       S1     S2
gene_symbol              
a            100.0    1.3
b            100.0   14.0
c            112.0  125.0

But if duplicates, need pivot_table or aggregate with groupby or , mean can be changed to sum, median, ...:

df = pd.DataFrame({
               'fc': [100,100,112,1.3,14,125, 100],
               'sample_id': ['S1','S1','S1','S2','S2','S2', 'S2'],
               'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c', 'c'],
               })
print (df)
      fc gene_symbol sample_id
0  100.0           a        S1
1  100.0           b        S1
2  112.0           c        S1
3    1.3           a        S2
4   14.0           b        S2
5  125.0           c        S2 <- same c, S2, different fc
6  100.0           c        S2 <- same c, S2, different fc

df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')

ValueError: Index contains duplicate entries, cannot reshape

df = df.pivot_table(index='gene_symbol',columns='sample_id',values='fc', aggfunc='mean')
print (df)
sample_id       S1     S2
gene_symbol              
a            100.0    1.3
b            100.0   14.0
c            112.0  112.5


df = df.groupby(['gene_symbol','sample_id'])['fc'].mean().unstack(fill_value=0)
print (df)
sample_id       S1     S2
gene_symbol              
a            100.0    1.3
b            100.0   14.0
c            112.0  112.5

EDIT:

For cleaning set columns name to None and reset_index:

df.columns.name = None
df = df.reset_index()
print (df)
  gene_symbol     S1     S2
0           a  100.0    1.3
1           b  100.0   14.0
2           c  112.0  112.5

这篇关于如何在 Pandas 数据框中展开一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆