如何从列中分割字符串以创建长格式数据帧 [英] How to split string from column to create long format dataframe

查看:121
本文介绍了如何从列中分割字符串以创建长格式数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我的数据框如下所示,我如何制作一个长格式的数据帧(即每行每个基因一个术语)。



我想我将必须申请或映射一个 split(,) 条款列,但之后该怎么办?

  import pandas as pd 
from StringIO import StringIO

df = pd.read_table(StringIO(Gene Gene
Mt-nd1 GO:0005739 ,GO:0005743,GO:0016021,GO:0030425,GO:0043025,GO:0070469,GO:0005623,GO:0005622,GO:0005737
Madd GO:0016021,GO:0045202,GO:0005886
Zmiz1 GO:0005654,GO:0043231
Cdca7 GO:0005622,GO:0005623,GO:0005737,GO:0005634,GO:0005654),sep =\s +)

Ps。上表简化了,实际的 df 将有更多列。



Psps。如果我不清楚,我想要得到如下结果:

  Mt-nd1 GO:0005739 
Mt -nd1 GO:0005743
Mt-nd1 GO:0016021
...
Cdca7 GO:0005634
Cdca7 GO:0005654
/ pre>

解决方案

您可以使用 str.split 进行拆分(而不是应用和拆分方法,但类似):

 在[6]中:splitted = df ['条款']。 str.split(',',expand = True)

在[7]中:拆分
输出[7]:
0 1 2 3 4 5 \
0 GO:0005739 GO:0005743 GO:0016021 GO:0030425 GO:0043025 GO:0070469
1 GO:0016021 GO:0045202 GO:0005886 NaN NaN NaN
2 GO:0005654 GO:0043231 NaN NaN NaN NaN
3 GO:0005622 GO:0005623 GO:0005737 GO:0005634 GO:0005654 NaN

6 7 8
0 GO:0005623 GO:00 05622 GO:0005737
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN

要将其转换为列(而不是列表),您可以使用 expand = True 关键字 split ,或者对于较老的大熊猫版本,您可以执行 df ['Terms']。str.split(',')。apply(pd.Series)相同的。



现在,要获得所需的输出,我们必须堆叠这些列,但首先将其与基因列合并,使其在堆叠框架中具有以下信息: / p>

 在[14]中:stacking = pd.concat([df ['Gene'],split],axis = 1)。 set_index('Gene')。stack()
在[15]中:堆栈
输出[15]:
基因
Mt-nd1 0 GO:0005739
1 GO:0005743
2 GO:0016021
3 GO:0030425
4 GO:0043025
5 GO:0070469
6 GO:0005623
7 GO: 0005622
8 GO:0005737
Madd 0 GO:0016021
1 GO:0045202
2 GO:0005886
Zmiz1 0 GO:0005654
1 GO:0043231
Cdca7 0 GO:0005622
1 GO:0005623
2 GO:0005737
3 GO:0005634
4 GO:0005654
dtype:object

从这里,我们可以重新设置索引,重命名我们的列与术语,并删除整数列(从自动生成的列名称)我们不需要了:

 在[19]中:stacking.rename(columns = {0:'Term'} ).drop('level_1',轴= 1)
输出[19]:
基因术语
0 Mt-nd1 GO:0005739
1 Mt-nd1 GO:0005743
2 Mt-nd1 GO:0016021
3 Mt-nd1 GO:0030425
4 Mt-nd1 GO:0043025
5 Mt-nd1 GO:0070469
6 Mt- nd1 GO:0005623
7 Mt-nd1 GO:0005622
8 Mt-nd1 GO:0005737
9 Madd GO:0016021
10 Madd GO:0045202
11 Madd G O:0005886
12 Zmiz1 GO:0005654
13 Zmiz1 GO:0043231
14 Cdca7 GO:0005622
15 Cdca7 GO:0005623
16 Cdca7 GO:0005737
17 Cdca7 GO:0005634
18 Cdca7 GO:0005654

如何组合或与您拥有的其他列合并,将取决于您完全想要做的事情。


If I have the dataframe shown below, how do I make a long format dataframe (I.e. one term per gene per row).

I guess I will have to apply or map a split(",") to the Term column, but what do I do after that?

import pandas as pd
from StringIO import StringIO

df = pd.read_table(StringIO("""Gene    Terms
Mt-nd1  GO:0005739,GO:0005743,GO:0016021,GO:0030425,GO:0043025,GO:0070469,GO:0005623,GO:0005622,GO:0005737
Madd    GO:0016021,GO:0045202,GO:0005886
Zmiz1   GO:0005654,GO:0043231
Cdca7   GO:0005622,GO:0005623,GO:0005737,GO:0005634,GO:0005654"""), sep="\s+")

Ps. the table above is simplified, the actual df will have many more columns.

Psps. In case I was unclear, I want to end up with something like:

Mt-nd1  GO:0005739
Mt-nd1  GO:0005743
Mt-nd1  GO:0016021
...
Cdca7   GO:0005634
Cdca7   GO:0005654

解决方案

You can use str.split to do the splitting (instead of apply and split approach, but similar):

In [6]: splitted = df['Terms'].str.split(',', expand=True)

In [7]: splitted 
Out[7]:
            0           1           2           3           4           5  \
0  GO:0005739  GO:0005743  GO:0016021  GO:0030425  GO:0043025  GO:0070469
1  GO:0016021  GO:0045202  GO:0005886         NaN         NaN         NaN
2  GO:0005654  GO:0043231         NaN         NaN         NaN         NaN
3  GO:0005622  GO:0005623  GO:0005737  GO:0005634  GO:0005654         NaN

            6           7           8
0  GO:0005623  GO:0005622  GO:0005737
1         NaN         NaN         NaN
2         NaN         NaN         NaN
3         NaN         NaN         NaN

To turn it into columns (instead of a list), you can use expand=True keyword to split, or for older pandas versions you can do df['Terms'].str.split(',').apply(pd.Series) to obtain the same.

Now, to obtain your desired output we have to stack these columns, but first merge it with the genes column to have this information in the stacked frame:

In [14]: stacked = pd.concat([df['Gene'], splitted],axis=1).set_index('Gene').stack()
In [15]: stacked
Out[15]:
Gene
Mt-nd1  0    GO:0005739
        1    GO:0005743
        2    GO:0016021
        3    GO:0030425
        4    GO:0043025
        5    GO:0070469
        6    GO:0005623
        7    GO:0005622
        8    GO:0005737
Madd    0    GO:0016021
        1    GO:0045202
        2    GO:0005886
Zmiz1   0    GO:0005654
        1    GO:0043231
Cdca7   0    GO:0005622
        1    GO:0005623
        2    GO:0005737
        3    GO:0005634
        4    GO:0005654
dtype: object

From here, we can reset the index, rename our column with terms, and drop the integer column (from the automatically generated column names) we don't need anymore:

In [19]: stacked.rename(columns={0:'Term'}).drop('level_1', axis=1)
Out[19]:
      Gene        Term
0   Mt-nd1  GO:0005739
1   Mt-nd1  GO:0005743
2   Mt-nd1  GO:0016021
3   Mt-nd1  GO:0030425
4   Mt-nd1  GO:0043025
5   Mt-nd1  GO:0070469
6   Mt-nd1  GO:0005623
7   Mt-nd1  GO:0005622
8   Mt-nd1  GO:0005737
9     Madd  GO:0016021
10    Madd  GO:0045202
11    Madd  GO:0005886
12   Zmiz1  GO:0005654
13   Zmiz1  GO:0043231
14   Cdca7  GO:0005622
15   Cdca7  GO:0005623
16   Cdca7  GO:0005737
17   Cdca7  GO:0005634
18   Cdca7  GO:0005654

How this can be combined or merged with the other columns you have, will depend on what you exactly want to do with it.

这篇关于如何从列中分割字符串以创建长格式数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆