结合下numpy的阵列一个头两列 [英] Combine two columns under one header in Numpy array

查看:468
本文介绍了结合下numpy的阵列一个头两列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我需要结合保持仅从 A 某些列地2 numpy的阵列 - 大小(888 1114253),这取决于我在 B 行 - 尺寸(555861,3)

问题是:页眉 A 55730 :每列有两个值

在换句话说,我想只列 A ,其中头对应的行中的 B ,但在 A 每一列是双

下面举例说明:

A

 系列ID妈妈爸爸RS1 RS2 RS3 RS4,RS5 RS6 RS7 RS8 RS9 RS10 RS11 RS12
     1 1 4 6的T A为T T C C G G A T A G A一个G A T A G G C C
     2 2 7 9吨A G A C T C的Tg为T T A A A C G G T A C C(C T)
     3 3 2 8 T(T)G G(C T)(C T)G G一件T A G A C G G T(T)C C C C
     4 4 5 1 A A A A T(T)C C G A T(T)A A A A G A T A克碳(C T)

由于在这个文件中的每个 rsxxx 列标题有两个相应的栏目,我必须找到一种方法把它们放在一起,这样我就可以读取该文件作为一个数组

B

  1 RS1 2345
1 RS2 2346
2 RS5 2348
4 RS8 2351
4 RS12 2360

所需的输出是

输出

 系列ID妈妈爸爸RS1 RS2 RS5 RS8 RS12
  1 1 4 6一件T A A G G A A C C
  2 2 7 9 T A G A G A A C(C T)
  3 3 2 8 T(T)G G G G A C C C
  4 4 5 1 A A A A G A A A(C T)

想法?

在控制台

B

 阵列([['1','rs3094315','752566'],
       ['1','rs12562034','768448'],
       ['1','rs3934834','1005806'],
       ...
       ['23','rs2032612','21866491'],
       ['23','rs2032621','21872738'],
       ['23','rs2032617','21896261']]
      DTYPE ='< S10')


解决方案

它看起来像每一列是由两个空格隔开,但每个基因对由一个空格隔开。如果是这样,你可以使用

分隔符=''#two空间

np.loadtxt

导入numpy的是NP
从StringIO的进口StringIO的#例如文件一个StringIO的=(家族ID的妈妈爸爸RS1 RS2 RS3 RS4,RS5 RS6 RS7 RS8 RS9 RS10 RS11 RS12
1 1 4 6的T A为T T C C G G A T A G A一个G A T A G G C C
2 2 7 9吨A G A C T C的Tg为T T A A A C G G T A C C(C T)
3 3 2 8 T(T)G G(C T)(C T)G G一件T A G A C G G T(T)C C C C
4 4 5 1 A A A A T(T)C C G A T(T)A A A A G一件T A G C C T)
NRS = 12#`的列rs`号码,DTYPE
DT ='诠释'* 4 +'S10'* NRSA = np.genfromtxt(A,​​分隔符='',名字= TRUE,DTYPE = DT)

A

阵列([(1,1,4,6,'A T','A A','T(T) ','C C','G G','A T','A G','A A','G A','T A','G G','C C'),
       (2,2,7,9,'T A','G A','C T','C T','G A','T T','A A,A C',' G G','T A','C C','C T'),
       (3,3,2,8,'T T','G G','C T','C T','G G','A T','A G','A C',' G G','T T','C C','C C'),
       (4,4,5,1,'A A','A A','T T','C C','G A','T T','A A','A A',' G A','T A','G C','C T')]
      DTYPE = [('家庭','< I8'),('ID','< I8'),('妈妈','< I8'),('爸爸','< I8') ,('RS1','S10'),('RS2','S10'),('RS3','S10'),('RS4','S10'),('RS5','S10') ,('RS6','S10'),('RS7','S10'),('RS8','S10'),('RS9','S10'),('RS10,S10) ,('RS11,S10'),('RS12,S10)])

然后再从 B 只能访问栏目,做这样的事情:

B = StringIO的(1 RS1 2345
1 RS2 2346
2 RS5 2348
4 RS8 2351
4 RS12 2360)B = np.genfromtxt(二,usecols = [1],DTYPE ='S10')

现在,使用 A [B]

A [B]
阵列([('A T','A A','G G','A A','C C'),
       ('T A','G A','G A','A C','C T'),
       (T T','G G','G G','A C','C C'),
       ('A A','A A','G A','A A','C T')]
      DTYPE = [('RS1','S10'),('RS2','S10'),('RS5','S10'),('RS8','S10'),('RS12,S10 ')])

或者,如果你想前四列太:

A ['家庭','ID','妈妈','爸爸'] +表(B)]
阵列([(1,1,4,6,'A T','A A','G G','A A','C C'),
       (2,2,7,9,'T A','G A','G A','A C','C T'),
       (3,3,2,8,'T T','G G','G G','A C','C C'),
       (4,4,5,1,'A A','A A','G A','A A','C T')]
      DTYPE = [('家庭','< I8'),('ID','< I8'),('妈妈','< I8'),('爸爸','< I8') ,('RS1','S10'),('RS2','S10'),('RS5','S10'),('RS8','S10'),('RS12,S10) ])

I have two Numpy arrays which I need to combine maintaining only certain columns from A - size (888, 1114253), depending on the rows I have in B - size (555861, 3).

The problem is that the header of A is 55730: each column has two values!

In other words I want to get only the columns of A where the header corresponds to the rows in B, but in A each column is "double"

An example will clarify:

A:

family id mum dad  rs1  rs2  rs3  rs4  rs5  rs6  rs7  rs8  rs9  rs10  rs11  rs12
     1  1   4   6   A T  A A  T T  C C  G G  A T  A G  A A  G A  T A  G G  C C 
     2  2   7   9   T A  G A  C T  C T  G A  T T  A A  A C  G G  T A  C C  C T 
     3  3   2   8   T T  G G  C T  C T  G G  A T  A G  A C  G G  T T  C C  C C 
     4  4   5   1   A A  A A  T T  C C  G A  T T  A A  A A  G A  T A  G C  C T 

Since in this file each rsxxx column header has two corresponding columns, I have to find a way to put them together, so I can read the file as an array

B:

1  rs1 2345
1  rs2 2346
2  rs5 2348
4  rs8 2351
4 rs12 2360

The desired output is

Output:

 family id mum dad  rs1 rs2 rs5 rs8 rs12
  1      1   4   6  A T A A G G A A C C
  2      2   7   9  T A G A G A A C C T
  3      3   2   8  T T G G G G A C C C
  4      4   5   1  A A A A G A A A C T

Ideas?

On the console

B:

array([['1', 'rs3094315', '752566'],
       ['1', 'rs12562034', '768448'],
       ['1', 'rs3934834', '1005806'],
       ..., 
       ['23', 'rs2032612', '21866491'],
       ['23', 'rs2032621', '21872738'],
       ['23', 'rs2032617', '21896261']], 
      dtype='<S10')

解决方案

It looks like each column is separated by two spaces, but that each gene pair is separated by one space. If this is so you can use

delimiter='  '   #two spaces

in np.loadtxt:

import numpy as np
from StringIO import StringIO # for example file

a = StringIO("""family  id  mum  dad  rs1  rs2  rs3  rs4  rs5  rs6  rs7  rs8  rs9  rs10  rs11  rs12
1  1   4   6   A T  A A  T T  C C  G G  A T  A G  A A  G A  T A  G G  C C 
2  2   7   9   T A  G A  C T  C T  G A  T T  A A  A C  G G  T A  C C  C T 
3  3   2   8   T T  G G  C T  C T  G G  A T  A G  A C  G G  T T  C C  C C 
4  4   5   1   A A  A A  T T  C C  G A  T T  A A  A A  G A  T A  G C  C T """)


nrs = 12        # number of `rs` columns, for dtype
dt = 'int,'*4 + 'S10,'*nrs

A = np.genfromtxt(a, delimiter='  ', names=True, dtype=dt)

A:

array([ (1, 1, 4, 6, ' A T', 'A A', 'T T', 'C C', 'G G', 'A T', 'A G', 'A A', 'G A', 'T A', 'G G', 'C C'),
       (2, 2, 7, 9, ' T A', 'G A', 'C T', 'C T', 'G A', 'T T', 'A A', 'A C', 'G G', 'T A', 'C C', 'C T'),
       (3, 3, 2, 8, ' T T', 'G G', 'C T', 'C T', 'G G', 'A T', 'A G', 'A C', 'G G', 'T T', 'C C', 'C C'),
       (4, 4, 5, 1, ' A A', 'A A', 'T T', 'C C', 'G A', 'T T', 'A A', 'A A', 'G A', 'T A', 'G C', 'C T')], 
      dtype=[('family', '<i8'), ('id', '<i8'), ('mum', '<i8'), ('dad', '<i8'), ('rs1', 'S10'), ('rs2', 'S10'), ('rs3', 'S10'), ('rs4', 'S10'), ('rs5', 'S10'), ('rs6', 'S10'), ('rs7', 'S10'), ('rs8', 'S10'), ('rs9', 'S10'), ('rs10', 'S10'), ('rs11', 'S10'), ('rs12', 'S10')])

Then to access only the columns from B, do something like this:

b = StringIO("""1  rs1 2345
1  rs2 2346
2  rs5 2348
4  rs8 2351
4 rs12 2360""")

B = np.genfromtxt(b, usecols=[1], dtype='S10')

Now, use A[B]:

A[B]
array([(' A T', 'A A', 'G G', 'A A', 'C C'),
       (' T A', 'G A', 'G A', 'A C', 'C T'),
       (' T T', 'G G', 'G G', 'A C', 'C C'),
       (' A A', 'A A', 'G A', 'A A', 'C T')], 
      dtype=[('rs1', 'S10'), ('rs2', 'S10'), ('rs5', 'S10'), ('rs8', 'S10'), ('rs12', 'S10')])

Or, if you want the first four columns too:

A[['family', 'id', 'mum', 'dad'] + list(B)]
array([(1, 1, 4, 6, ' A T', 'A A', 'G G', 'A A', 'C C'),
       (2, 2, 7, 9, ' T A', 'G A', 'G A', 'A C', 'C T'),
       (3, 3, 2, 8, ' T T', 'G G', 'G G', 'A C', 'C C'),
       (4, 4, 5, 1, ' A A', 'A A', 'G A', 'A A', 'C T')], 
      dtype=[('family', '<i8'), ('id', '<i8'), ('mum', '<i8'), ('dad', '<i8'), ('rs1', 'S10'), ('rs2', 'S10'), ('rs5', 'S10'), ('rs8', 'S10'), ('rs12', 'S10')])

这篇关于结合下numpy的阵列一个头两列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆