在Python中将数据从长格式解析为宽格式 [英] Parsing Data From Long to Wide Format in Python

查看:98
本文介绍了在Python中将数据从长格式解析为宽格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道将python解析为长格式数据的最佳方法是什么.我以前一直在R中执行此类任务,但实际上我的文件可能需要1 GB以上的时间.这是一些虚拟数据:

I'm wondering what the best way to parse long form data into wide for is in python. I've previously been doing this sort of task in R but it really is taking to long as my files can be upwards of 1 gb. Here is some dummy data:

Sequence Position Strand Score
Gene1    0        +      1
Gene1    1        +      0.25
Gene1    0        -      1
Gene1    1        -      0.5
Gene2    0        +      0
Gene2    1        +      0.1
Gene2    0        -      0
Gene2    1        -      0.5

但是我想以宽泛的形式使用它,我在每个位置上汇总了各条线的分数.这是我希望得到的输出:

But I'd like to have it in the wide form where I've summed the scores over the strands at each position. Here is output I hope for:

Sequence 0 1
Gene1    2 0.75
Gene2    0 0.6

任何有关从概念上解决此类问题的帮助都将非常有帮助.

Any help on how to attack such a problem conceptually would be really helpful.

推荐答案

当您可以单线处理大熊猫时,这两种解决方案都显得过分了:

Both of these solutions seem like overkill when you can do it with pandas in a one-liner:

In [7]: df
Out[7]: 
  Sequence  Position Strand  Score
0    Gene1         0      +   1.00
1    Gene1         1      +   0.25
2    Gene1         0      -   1.00
3    Gene1         1      -   0.50
4    Gene2         0      +   0.00
5    Gene2         1      +   0.10
6    Gene2         0      -   0.00
7    Gene2         1      -   0.50

In [8]: df.groupby(['Sequence', 'Position']).Score.sum().unstack('Position')
Out[8]: 
Position  0     1
Sequence         
Gene1     2  0.75
Gene2     0  0.60

如果您无法将文件加载到内存中,那么其他答案中的核心解决方案也将起作用.

If you cannot load the file into memory then an out-of-core solution in the other answers will work too.

这篇关于在Python中将数据从长格式解析为宽格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆