如何从列中分割字符串以创建长格式数据帧 [英] How to split string from column to create long format dataframe
问题描述
如果我的数据框如下所示,我如何制作一个长格式的数据帧(即每行每个基因一个术语)。
我想我将必须申请
或映射一个 split(,)
到条款
列,但之后该怎么办?
import pandas as pd
from StringIO import StringIO
df = pd.read_table(StringIO(Gene Gene
Mt-nd1 GO:0005739 ,GO:0005743,GO:0016021,GO:0030425,GO:0043025,GO:0070469,GO:0005623,GO:0005622,GO:0005737
Madd GO:0016021,GO:0045202,GO:0005886
Zmiz1 GO:0005654,GO:0043231
Cdca7 GO:0005622,GO:0005623,GO:0005737,GO:0005634,GO:0005654),sep =\s +)
Ps。上表简化了,实际的 df
将有更多列。
Psps。如果我不清楚,我想要得到如下结果:
Mt-nd1 GO:0005739
/ pre>
Mt -nd1 GO:0005743
Mt-nd1 GO:0016021
...
Cdca7 GO:0005634
Cdca7 GO:0005654
解决方案您可以使用
str.split
进行拆分(而不是应用和拆分方法,但类似):在[6]中:splitted = df ['条款']。 str.split(',',expand = True)
在[7]中:拆分
输出[7]:
0 1 2 3 4 5 \
0 GO:0005739 GO:0005743 GO:0016021 GO:0030425 GO:0043025 GO:0070469
1 GO:0016021 GO:0045202 GO:0005886 NaN NaN NaN
2 GO:0005654 GO:0043231 NaN NaN NaN NaN
3 GO:0005622 GO:0005623 GO:0005737 GO:0005634 GO:0005654 NaN
6 7 8
0 GO:0005623 GO:00 05622 GO:0005737
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
要将其转换为列(而不是列表),您可以使用
expand = True
关键字split
,或者对于较老的大熊猫版本,您可以执行df ['Terms']。str.split(',')。apply(pd.Series)
相同的。
现在,要获得所需的输出,我们必须堆叠这些列,但首先将其与基因列合并,使其在堆叠框架中具有以下信息: / p>
在[14]中:stacking = pd.concat([df ['Gene'],split],axis = 1)。 set_index('Gene')。stack()
在[15]中:堆栈
输出[15]:
基因
Mt-nd1 0 GO:0005739
1 GO:0005743
2 GO:0016021
3 GO:0030425
4 GO:0043025
5 GO:0070469
6 GO:0005623
7 GO: 0005622
8 GO:0005737
Madd 0 GO:0016021
1 GO:0045202
2 GO:0005886
Zmiz1 0 GO:0005654
1 GO:0043231
Cdca7 0 GO:0005622
1 GO:0005623
2 GO:0005737
3 GO:0005634
4 GO:0005654
dtype:object
从这里,我们可以重新设置索引,重命名我们的列与术语,并删除整数列(从自动生成的列名称)我们不需要了:
在[19]中:stacking.rename(columns = {0:'Term'} ).drop('level_1',轴= 1)
输出[19]:
基因术语
0 Mt-nd1 GO:0005739
1 Mt-nd1 GO:0005743
2 Mt-nd1 GO:0016021
3 Mt-nd1 GO:0030425
4 Mt-nd1 GO:0043025
5 Mt-nd1 GO:0070469
6 Mt- nd1 GO:0005623
7 Mt-nd1 GO:0005622
8 Mt-nd1 GO:0005737
9 Madd GO:0016021
10 Madd GO:0045202
11 Madd G O:0005886
12 Zmiz1 GO:0005654
13 Zmiz1 GO:0043231
14 Cdca7 GO:0005622
15 Cdca7 GO:0005623
16 Cdca7 GO:0005737
17 Cdca7 GO:0005634
18 Cdca7 GO:0005654
如何组合或与您拥有的其他列合并,将取决于您完全想要做的事情。
If I have the dataframe shown below, how do I make a long format dataframe (I.e. one term per gene per row).
I guess I will have to
apply
or map asplit(",")
to theTerm
column, but what do I do after that?import pandas as pd from StringIO import StringIO df = pd.read_table(StringIO("""Gene Terms Mt-nd1 GO:0005739,GO:0005743,GO:0016021,GO:0030425,GO:0043025,GO:0070469,GO:0005623,GO:0005622,GO:0005737 Madd GO:0016021,GO:0045202,GO:0005886 Zmiz1 GO:0005654,GO:0043231 Cdca7 GO:0005622,GO:0005623,GO:0005737,GO:0005634,GO:0005654"""), sep="\s+")
Ps. the table above is simplified, the actual
df
will have many more columns.Psps. In case I was unclear, I want to end up with something like:
Mt-nd1 GO:0005739 Mt-nd1 GO:0005743 Mt-nd1 GO:0016021 ... Cdca7 GO:0005634 Cdca7 GO:0005654
解决方案You can use
str.split
to do the splitting (instead of apply and split approach, but similar):In [6]: splitted = df['Terms'].str.split(',', expand=True) In [7]: splitted Out[7]: 0 1 2 3 4 5 \ 0 GO:0005739 GO:0005743 GO:0016021 GO:0030425 GO:0043025 GO:0070469 1 GO:0016021 GO:0045202 GO:0005886 NaN NaN NaN 2 GO:0005654 GO:0043231 NaN NaN NaN NaN 3 GO:0005622 GO:0005623 GO:0005737 GO:0005634 GO:0005654 NaN 6 7 8 0 GO:0005623 GO:0005622 GO:0005737 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN
To turn it into columns (instead of a list), you can use
expand=True
keyword tosplit
, or for older pandas versions you can dodf['Terms'].str.split(',').apply(pd.Series)
to obtain the same.Now, to obtain your desired output we have to stack these columns, but first merge it with the genes column to have this information in the stacked frame:
In [14]: stacked = pd.concat([df['Gene'], splitted],axis=1).set_index('Gene').stack() In [15]: stacked Out[15]: Gene Mt-nd1 0 GO:0005739 1 GO:0005743 2 GO:0016021 3 GO:0030425 4 GO:0043025 5 GO:0070469 6 GO:0005623 7 GO:0005622 8 GO:0005737 Madd 0 GO:0016021 1 GO:0045202 2 GO:0005886 Zmiz1 0 GO:0005654 1 GO:0043231 Cdca7 0 GO:0005622 1 GO:0005623 2 GO:0005737 3 GO:0005634 4 GO:0005654 dtype: object
From here, we can reset the index, rename our column with terms, and drop the integer column (from the automatically generated column names) we don't need anymore:
In [19]: stacked.rename(columns={0:'Term'}).drop('level_1', axis=1) Out[19]: Gene Term 0 Mt-nd1 GO:0005739 1 Mt-nd1 GO:0005743 2 Mt-nd1 GO:0016021 3 Mt-nd1 GO:0030425 4 Mt-nd1 GO:0043025 5 Mt-nd1 GO:0070469 6 Mt-nd1 GO:0005623 7 Mt-nd1 GO:0005622 8 Mt-nd1 GO:0005737 9 Madd GO:0016021 10 Madd GO:0045202 11 Madd GO:0005886 12 Zmiz1 GO:0005654 13 Zmiz1 GO:0043231 14 Cdca7 GO:0005622 15 Cdca7 GO:0005623 16 Cdca7 GO:0005737 17 Cdca7 GO:0005634 18 Cdca7 GO:0005654
How this can be combined or merged with the other columns you have, will depend on what you exactly want to do with it.
这篇关于如何从列中分割字符串以创建长格式数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!