在Python中将数据从长格式解析为宽格式 [英] Parsing Data From Long to Wide Format in Python
问题描述
我想知道将python解析为长格式数据的最佳方法是什么.我以前一直在R中执行此类任务,但实际上我的文件可能需要1 GB以上的时间.这是一些虚拟数据:
I'm wondering what the best way to parse long form data into wide for is in python. I've previously been doing this sort of task in R but it really is taking to long as my files can be upwards of 1 gb. Here is some dummy data:
Sequence Position Strand Score
Gene1 0 + 1
Gene1 1 + 0.25
Gene1 0 - 1
Gene1 1 - 0.5
Gene2 0 + 0
Gene2 1 + 0.1
Gene2 0 - 0
Gene2 1 - 0.5
但是我想以宽泛的形式使用它,我在每个位置上汇总了各条线的分数.这是我希望得到的输出:
But I'd like to have it in the wide form where I've summed the scores over the strands at each position. Here is output I hope for:
Sequence 0 1
Gene1 2 0.75
Gene2 0 0.6
任何有关从概念上解决此类问题的帮助都将非常有帮助.
Any help on how to attack such a problem conceptually would be really helpful.
推荐答案
当您可以单线处理大熊猫时,这两种解决方案都显得过分了:
Both of these solutions seem like overkill when you can do it with pandas in a one-liner:
In [7]: df
Out[7]:
Sequence Position Strand Score
0 Gene1 0 + 1.00
1 Gene1 1 + 0.25
2 Gene1 0 - 1.00
3 Gene1 1 - 0.50
4 Gene2 0 + 0.00
5 Gene2 1 + 0.10
6 Gene2 0 - 0.00
7 Gene2 1 - 0.50
In [8]: df.groupby(['Sequence', 'Position']).Score.sum().unstack('Position')
Out[8]:
Position 0 1
Sequence
Gene1 2 0.75
Gene2 0 0.60
如果您无法将文件加载到内存中,那么其他答案中的核心解决方案也将起作用.
If you cannot load the file into memory then an out-of-core solution in the other answers will work too.
这篇关于在Python中将数据从长格式解析为宽格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!