如何从列中分割字符串以创建长格式数据帧 [英] How to split string from column to create long format dataframe

查看：121 发布时间：2017/3/26 2:07:57 python pandas dataframe

本文介绍了如何从列中分割字符串以创建长格式数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果我的数据框如下所示，我如何制作一个长格式的数据帧（即每行每个基因一个术语）。

我想我将必须申请或映射一个 split（，） 到条款列，但之后该怎么办？

  import pandas as pd 
 from StringIO import StringIO 
 
 df = pd.read_table（StringIO（Gene Gene 
 Mt-nd1 GO：0005739 ，GO：0005743，GO：0016021，GO：0030425，GO：0043025，GO：0070469，GO：0005623，GO：0005622，GO：0005737 
 Madd GO：0016021，GO：0045202，GO：0005886 
 Zmiz1 GO：0005654，GO：0043231 
 Cdca7 GO：0005622，GO：0005623，GO：0005737，GO：0005634，GO：0005654），sep =\s +）

Ps。上表简化了，实际的 df 将有更多列。

Psps。如果我不清楚，我想要得到如下结果：

Mt-nd1 GO：0005739 Mt -nd1 GO：0005743 Mt-nd1 GO：0016021 ... Cdca7 GO：0005634 Cdca7 GO：0005654 / pre>

解决方案

您可以使用 str.split 进行拆分（而不是应用和拆分方法，但类似）：

 在[6]中：splitted = df ['条款']。 str.split（'，'，expand = True）
 
在[7]中：拆分
输出[7]：
 0 1 2 3 4 5 \ 
 0 GO：0005739 GO：0005743 GO：0016021 GO：0030425 GO：0043025 GO：0070469 
 1 GO：0016021 GO：0045202 GO：0005886 NaN NaN NaN 
 2 GO：0005654 GO：0043231 NaN NaN NaN NaN 
 3 GO：0005622 GO：0005623 GO：0005737 GO：0005634 GO：0005654 NaN 
 
 6 7 8 
 0 GO：0005623 GO：00 05622 GO：0005737 
 1 NaN NaN NaN 
 2 NaN NaN NaN 
 3 NaN NaN NaN

要将其转换为列（而不是列表），您可以使用 expand = True 关键字 split ，或者对于较老的大熊猫版本，您可以执行 df ['Terms']。str.split（'，'）。apply（pd.Series）相同的。

现在，要获得所需的输出，我们必须堆叠这些列，但首先将其与基因列合并，使其在堆叠框架中具有以下信息： / p>

 在[14]中：stacking = pd.concat（[df ['Gene']，split]，axis = 1）。 set_index（'Gene'）。stack（）
在[15]中：堆栈
输出[15]：
基因
 Mt-nd1 0 GO：0005739 
 1 GO：0005743 
 2 GO：0016021 
 3 GO：0030425 
 4 GO：0043025 
 5 GO：0070469 
 6 GO：0005623 
 7 GO： 0005622 
 8 GO：0005737 
 Madd 0 GO：0016021 
 1 GO：0045202 
 2 GO：0005886 
 Zmiz1 0 GO：0005654 
 1 GO：0043231 
 Cdca7 0 GO：0005622 
 1 GO：0005623 
 2 GO：0005737 
 3 GO：0005634 
 4 GO：0005654 
 dtype：object

从这里，我们可以重新设置索引，重命名我们的列与术语，并删除整数列（从自动生成的列名称）我们不需要了：

 在[19]中：stacking.rename（columns = {0：'Term'} ）.drop（'level_1'，轴= 1）
输出[19]：
基因术语
 0 Mt-nd1 GO：0005739 
 1 Mt-nd1 GO：0005743 
 2 Mt-nd1 GO：0016021 
 3 Mt-nd1 GO：0030425 
 4 Mt-nd1 GO：0043025 
 5 Mt-nd1 GO：0070469 
 6 Mt- nd1 GO：0005623 
 7 Mt-nd1 GO：0005622 
 8 Mt-nd1 GO：0005737 
 9 Madd GO：0016021 
 10 Madd GO：0045202 
 11 Madd G O：0005886 
 12 Zmiz1 GO：0005654 
 13 Zmiz1 GO：0043231 
 14 Cdca7 GO：0005622 
 15 Cdca7 GO：0005623 
 16 Cdca7 GO：0005737 
 17 Cdca7 GO：0005634 
 18 Cdca7 GO：0005654

如何组合或与您拥有的其他列合并，将取决于您完全想要做的事情。

If I have the dataframe shown below, how do I make a long format dataframe (I.e. one term per gene per row). I guess I will have to apply or map a split(",") to the Term column, but what do I do after that? import pandas as pd from StringIO import StringIO df = pd.read_table(StringIO("""Gene Terms Mt-nd1 GO:0005739,GO:0005743,GO:0016021,GO:0030425,GO:0043025,GO:0070469,GO:0005623,GO:0005622,GO:0005737 Madd GO:0016021,GO:0045202,GO:0005886 Zmiz1 GO:0005654,GO:0043231 Cdca7 GO:0005622,GO:0005623,GO:0005737,GO:0005634,GO:0005654"""), sep="\s+") Ps. the table above is simplified, the actual df will have many more columns. Psps. In case I was unclear, I want to end up with something like: Mt-nd1 GO:0005739 Mt-nd1 GO:0005743 Mt-nd1 GO:0016021 ... Cdca7 GO:0005634 Cdca7 GO:0005654 解决方案 You can use str.split to do the splitting (instead of apply and split approach, but similar): In [6]: splitted = df['Terms'].str.split(',', expand=True) In [7]: splitted Out[7]: 0 1 2 3 4 5 \ 0 GO:0005739 GO:0005743 GO:0016021 GO:0030425 GO:0043025 GO:0070469 1 GO:0016021 GO:0045202 GO:0005886 NaN NaN NaN 2 GO:0005654 GO:0043231 NaN NaN NaN NaN 3 GO:0005622 GO:0005623 GO:0005737 GO:0005634 GO:0005654 NaN 6 7 8 0 GO:0005623 GO:0005622 GO:0005737 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN To turn it into columns (instead of a list), you can use expand=True keyword to split, or for older pandas versions you can do df['Terms'].str.split(',').apply(pd.Series) to obtain the same. Now, to obtain your desired output we have to stack these columns, but first merge it with the genes column to have this information in the stacked frame: In [14]: stacked = pd.concat([df['Gene'], splitted],axis=1).set_index('Gene').stack() In [15]: stacked Out[15]: Gene Mt-nd1 0 GO:0005739 1 GO:0005743 2 GO:0016021 3 GO:0030425 4 GO:0043025 5 GO:0070469 6 GO:0005623 7 GO:0005622 8 GO:0005737 Madd 0 GO:0016021 1 GO:0045202 2 GO:0005886 Zmiz1 0 GO:0005654 1 GO:0043231 Cdca7 0 GO:0005622 1 GO:0005623 2 GO:0005737 3 GO:0005634 4 GO:0005654 dtype: object From here, we can reset the index, rename our column with terms, and drop the integer column (from the automatically generated column names) we don't need anymore: In [19]: stacked.rename(columns={0:'Term'}).drop('level_1', axis=1) Out[19]: Gene Term 0 Mt-nd1 GO:0005739 1 Mt-nd1 GO:0005743 2 Mt-nd1 GO:0016021 3 Mt-nd1 GO:0030425 4 Mt-nd1 GO:0043025 5 Mt-nd1 GO:0070469 6 Mt-nd1 GO:0005623 7 Mt-nd1 GO:0005622 8 Mt-nd1 GO:0005737 9 Madd GO:0016021 10 Madd GO:0045202 11 Madd GO:0005886 12 Zmiz1 GO:0005654 13 Zmiz1 GO:0043231 14 Cdca7 GO:0005622 15 Cdca7 GO:0005623 16 Cdca7 GO:0005737 17 Cdca7 GO:0005634 18 Cdca7 GO:0005654 How this can be combined or merged with the other columns you have, will depend on what you exactly want to do with it. 这篇关于如何从列中分割字符串以创建长格式数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何从列中分割字符串以创建长格式数据帧 [英] How to split string from column to create long format dataframe

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从列中分割字符串以创建长格式数据帧 [英] How to split string from column to create long format dataframe

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭