Pandas 数据框上的分位数归一化 [英] quantile normalization on pandas dataframe

查看:34
本文介绍了Pandas 数据框上的分位数归一化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简单地说,如何在 Python 中对大型 Pandas 数据帧(可能有 2,000,000 行)应用分位数归一化?

附注.我知道有一个名为 rpy2 的包可以在子进程中运行 R,在 R 中使用分位数归一化.但事实是,当我使用如下数据集时,R 无法计算出正确的结果:

<预> <代码> 5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-068.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-065.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-062.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05

我想要的:

鉴于上面显示的数据,如何按照 https://en.wikipedia 中的步骤应用分位数归一化.org/wiki/Quantile_normalization.

我在 Python 中发现了一段代码,声明它可以计算分位数归一化:

import rpy2.robjects 作为robjects将 numpy 导入为 np从 rpy2.robjects.packages 导入导入器preprocessCore = importr('preprocessCore')矩阵 = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]v = robjects.FloatVector([col in matrix for element for col in col 的元素])m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)Rnormalized_matrix = preprocessCore.normalize_quantiles(m)normalized_matrix = np.array( Rnormalized_matrix)

该代码与代码中使用的示例数据配合良好,但是当我使用上面给出的数据对其进行测试时,结果出错了.

由于ryp2提供了在python子进程中运行R的接口,我直接在R中再次测试,结果还是错误.结果我觉得是R里面的方法不对.

解决方案

好的是我自己实现的,效率比较高的方法.

完成后,这个逻辑似乎有点简单,但无论如何,我决定将其发布在这里,因为任何人都感到困惑,就像我无法在谷歌上搜索可用代码一样.

代码在github:分位数归一化

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:

5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05

Edit:

What I want:

Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.

I found a piece of code in Python declaring that it could compute the quantile normalization:

import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')


matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)

The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.

Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.

解决方案

Ok I implemented the method myself of relatively high efficiency.

After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.

The code is in github: Quantile Normalize

这篇关于Pandas 数据框上的分位数归一化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆