使用python(2.6.1)比较/提取矩阵中的数据 [英] comparing/extracting data from matrices using python (2.6.1)

查看:602
本文介绍了使用python(2.6.1)比较/提取矩阵中的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个.csv文件,其中包含从R导出的相关矩阵。一个文件包含P值,一个文件包含R值。行标题和列标题在两个文件之间完全匹配。

I have two .csv files containing correlation matrices exported from R. One file contains the P-values and one contains the r-values. The row and column headers match exactly between the two files.

我仅在P值<时才尝试提取r值以及对应的行和列标题。 0.05。这是r值输入文件中数据的示例(我有1700多个相关项,而不仅仅是显示的两个项):

I am trying to extract the r-values and corresponding row and column header for pairs only when the P-value < 0.05. Here is a sample of what the data in the r-value input file looks like (I have 1700+ correlated items, rather than only the two shown):

            Species1                 Species2
Species1      1                       0.9
Species2      0.9                     1

P值输入文件是相同的,除了包含P值代替r值。

The P-value input file is identical, except containing P-values in place of r-values.

我对Python相对陌生,不确定如何处理此类文件。我尝试了一些策略,包括使用csv库遍历文件。我考虑使用numpy,但似乎对我没有用(?)。我还研究了在Python中使用scipy计算r和P值(皮尔逊),但似乎这仅适用于比较两个一维数组(我有1700+列数据要关联)。

I am relatively new to Python, and am not sure how to handle files of this type. I have tried a few strategies, including using the csv library to iterate through the files. I looked into using numpy, but it doesn't seem that it will work for me (?). I also looked into using scipy to calculate r- and P-values (Pearsons) in Python, but it seems that this only works for comparing two one dimensional arrays (I have 1700+ columns of data to correlate).

我从这里开始的代码,向您展示我导入的内容:

Code I am starting with, to show you what I have imported:

import csv
infileP = open('AllcorrP.csv', 'rU')
infileR = open('AllcorrR.csv', 'rU')

问题
谁能帮助我根据以下内容从我的r值文件中提取列标题和行标题以及r值: p值文件中是否存在显着的(<0.05)P值?

The question Can anyone help me extract the column and row headers and r-values from my r-value file based on significant (< 0.05) P-values from my p-value file?

OR

直接使用Python计算许多数据列之间所有可能的相关性的r值和P值,而仅提取具有显着P值的结果?

Calculate the r- and P-values for all possible correlations between many columns of data directly using Python and extract only the results with significant P-values?

最后,我想在两个文件中输出。

第一个文件:

In the end, I would like output in two files.
First file:

Species1   Species2   Species4  ...
Species2   Species1   Species7  ...

等...(其中 Species1是第一个具有重要意义的物种相关性和该行中的下一个项目是与其显着相关的物种(Species2,Species4等)。

etc...(where "Species1" is the first species with significant correlations and the next items on the line are the species that it significantly correlated with (Species2, Species4 etc.)

第二个文件:

Species1 (corr) Species2 = 0.87
Species2 (corr) Species7 = 0.72
...

等。

此时,我很高兴能够提取出r-的列表所需的值和种类,并在以后找出最后两个文件格式。谢谢!

At this point, I'd be happy to just be able to extract a list of the r-values and species that I want and figure out the final two file formatting later. Thank you!

推荐答案

要读取数据,您应该可以使用numpy.genfromtext。请参阅文档,此功能内有大量功能。要阅读上面的示例,您可以这样做:

To read the data, you should be able to use numpy.genfromtext. See the documentation, there is a ton of functionality within this function. To read your example above, you might do:

from numpy import genfromtxt
rdata = genfromtxt('AllcorrR.csv', skip_header=1)[:,1:]
Pdata = genfromtxt('AllcorrP.csv', skip_header=1)[:,1:]

[:, 1:]将在读入数据时忽略数据的第一列。该函数没有忽略前x列的输入就像对行一样(通过skip_header)。不确定为什么他们不执行此操作,这总是让我感到困惑。

The [:,1:] is to ignore the first column of data when read in. The function doesn't have an input to "ignore the first x columns" like it does for rows (via skip_header). Not sure why they didn't implement this, it always bugged me.

这只会读取P的数据(也可以为r读取数据)。然后,您可以轻松过滤数据。您可以在第一行和第一列中阅读以获取标题。或者,如果您看到genfromtxt文档,也可以将它们命名(创建一个recarray)。

This would just read the data for P (can also do this for r). Then you can filter the data pretty easily. You could read in the first row and column separated to get the headings. Or if you see the genfromtxt documentation, you could also name them (create a recarray).

要查找r小于0.50的索引(值),您可以只需进行比较,然后numpy自动为您创建一个布尔数组:

To find the indices (values) where r is less then 0.50, you can simply do a comparison and numpy automagically creates a boolean array for you:

print Pdata < 0.05

这可以用作rdata的索引(确保行数相同/列):

This can be used as an index into rdata (make sure there are the same number of rows/columns):

print rdata[Pdata < 0.05]

这篇关于使用python(2.6.1)比较/提取矩阵中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆