结合下numpy的阵列一个头两列 [英] Combine two columns under one header in Numpy array

查看：468 发布时间：2016/6/3 10:55:38 python arrays numpy

本文介绍了结合下numpy的阵列一个头两列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有我需要结合保持仅从 A 某些列地2 numpy的阵列 - 大小（888 1114253），这取决于我在 B 行 - 尺寸（555861，3）

问题是：页眉 A 是 55730 ：每列有两个值

在换句话说，我想只列 A ，其中头对应的行中的 B ，但在 A 每一列是双

下面举例说明：

A ：

 系列ID妈妈爸爸RS1 RS2 RS3 RS4，RS5 RS6 RS7 RS8 RS9 RS10 RS11 RS12
     1 1 4 6的T A为T T C C G G A T A G A一个G A T A G G C C
     2 2 7 9吨A G A C T C的Tg为T T A A A C G G T A C C（C T）
     3 3 2 8 T（T）G G（C T）（C T）G G一件T A G A C G G T（T）C C C C
     4 4 5 1 A A A A T（T）C C G A T（T）A A A A G A T A克碳（C T）

由于在这个文件中的每个 rsxxx 列标题有两个相应的栏目，我必须找到一种方法把它们放在一起，这样我就可以读取该文件作为一个数组

B ：

  1 RS1 2345
1 RS2 2346
2 RS5 2348
4 RS8 2351
4 RS12 2360

所需的输出是

输出：

 系列ID妈妈爸爸RS1 RS2 RS5 RS8 RS12
  1 1 4 6一件T A A G G A A C C
  2 2 7 9 T A G A G A A C（C T）
  3 3 2 8 T（T）G G G G A C C C
  4 4 5 1 A A A A G A A A（C T）

想法？

在控制台

B ：

 阵列（[['1'，'rs3094315'，'752566']，
       ['1'，'rs12562034'，'768448']，
       ['1'，'rs3934834'，'1005806']，
       ...
       ['23'，'rs2032612'，'21866491']，
       ['23'，'rs2032621'，'21872738']，
       ['23'，'rs2032617'，'21896261']]
      DTYPE ='＆LT; S10'）

解决方案

它看起来像每一列是由两个空格隔开，但每个基因对由一个空格隔开。如果是这样，你可以使用

分隔符=''#two空间

在 np.loadtxt ：

导入numpy的是NP
从StringIO的进口StringIO的＃例如文件一个StringIO的=（家族ID的妈妈爸爸RS1 RS2 RS3 RS4，RS5 RS6 RS7 RS8 RS9 RS10 RS11 RS12
1 1 4 6的T A为T T C C G G A T A G A一个G A T A G G C C
2 2 7 9吨A G A C T C的Tg为T T A A A C G G T A C C（C T）
3 3 2 8 T（T）G G（C T）（C T）G G一件T A G A C G G T（T）C C C C
4 4 5 1 A A A A T（T）C C G A T（T）A A A A G一件T A G C C T）
NRS = 12＃`的列rs`号码，DTYPE
DT ='诠释'* 4 +'S10'* NRSA = np.genfromtxt（A，分隔符=''，名字= TRUE，DTYPE = DT）

A ：

阵列（[（1，1，4，6，'A T'，'A A'，'T（T） '，'C C'，'G G'，'A T'，'A G'，'A A'，'G A'，'T A'，'G G'，'C C'），
       （2，2，7，9，'T A'，'G A'，'C T'，'C T'，'G A'，'T T'，'A A，A C'，' G G'，'T A'，'C C'，'C T'），
       （3,3，2,8，'T T'，'G G'，'C T'，'C T'，'G G'，'A T'，'A G'，'A C'，' G G'，'T T'，'C C'，'C C'），
       （4，4，5，1，'A A'，'A A'，'T T'，'C C'，'G A'，'T T'，'A A'，'A A'，' G A'，'T A'，'G C'，'C T'）]
      DTYPE = [（'家庭'，'＆LT; I8'），（'ID'，'＆LT; I8'），（'妈妈'，'＆LT; I8'），（'爸爸'，'＆LT; I8'） ，（'RS1'，'S10'），（'RS2'，'S10'），（'RS3'，'S10'），（'RS4'，'S10'），（'RS5'，'S10'） ，（'RS6'，'S10'），（'RS7'，'S10'），（'RS8'，'S10'），（'RS9'，'S10'），（'RS10，S10） ，（'RS11，S10'），（'RS12，S10）]）

然后再从 B 只能访问栏目，做这样的事情：

 B = StringIO的（1 RS1 2345
1 RS2 2346
2 RS5 2348
4 RS8 2351
4 RS12 2360）B = np.genfromtxt（二，usecols = [1]，DTYPE ='S10'）

现在，使用 A [B] ：

 A [B]
阵列（[（'A T'，'A A'，'G G'，'A A'，'C C'），
       （'T A'，'G A'，'G A'，'A C'，'C T'），
       （T T'，'G G'，'G G'，'A C'，'C C'），
       （'A A'，'A A'，'G A'，'A A'，'C T'）]
      DTYPE = [（'RS1'，'S10'），（'RS2'，'S10'），（'RS5'，'S10'），（'RS8'，'S10'），（'RS12，S10 '）]）

或者，如果你想前四列太：

 A ['家庭'，'ID'，'妈妈'，'爸爸'] +表（B）]
阵列（[（1，1，4，6，'A T'，'A A'，'G G'，'A A'，'C C'），
       （2，2，7，9，'T A'，'G A'，'G A'，'A C'，'C T'），
       （3,3，2,8，'T T'，'G G'，'G G'，'A C'，'C C'），
       （4，4，5，1，'A A'，'A A'，'G A'，'A A'，'C T'）]
      DTYPE = [（'家庭'，'＆LT; I8'），（'ID'，'＆LT; I8'），（'妈妈'，'＆LT; I8'），（'爸爸'，'＆LT; I8'） ，（'RS1'，'S10'），（'RS2'，'S10'），（'RS5'，'S10'），（'RS8'，'S10'），（'RS12，S10） ]）

I have two Numpy arrays which I need to combine maintaining only certain columns from A - size (888, 1114253), depending on the rows I have in B - size (555861, 3).

The problem is that the header of A is 55730: each column has two values!

In other words I want to get only the columns of A where the header corresponds to the rows in B, but in A each column is "double"

An example will clarify:

A:

family id mum dad  rs1  rs2  rs3  rs4  rs5  rs6  rs7  rs8  rs9  rs10  rs11  rs12
     1  1   4   6   A T  A A  T T  C C  G G  A T  A G  A A  G A  T A  G G  C C 
     2  2   7   9   T A  G A  C T  C T  G A  T T  A A  A C  G G  T A  C C  C T 
     3  3   2   8   T T  G G  C T  C T  G G  A T  A G  A C  G G  T T  C C  C C 
     4  4   5   1   A A  A A  T T  C C  G A  T T  A A  A A  G A  T A  G C  C T

Since in this file each rsxxx column header has two corresponding columns, I have to find a way to put them together, so I can read the file as an array

B:

1  rs1 2345
1  rs2 2346
2  rs5 2348
4  rs8 2351
4 rs12 2360

The desired output is

Output:

 family id mum dad  rs1 rs2 rs5 rs8 rs12
  1      1   4   6  A T A A G G A A C C
  2      2   7   9  T A G A G A A C C T
  3      3   2   8  T T G G G G A C C C
  4      4   5   1  A A A A G A A A C T

Ideas?

On the console

B:

array([['1', 'rs3094315', '752566'],
       ['1', 'rs12562034', '768448'],
       ['1', 'rs3934834', '1005806'],
       ..., 
       ['23', 'rs2032612', '21866491'],
       ['23', 'rs2032621', '21872738'],
       ['23', 'rs2032617', '21896261']], 
      dtype='<S10')

解决方案

It looks like each column is separated by two spaces, but that each gene pair is separated by one space. If this is so you can use

delimiter='  '   #two spaces

in np.loadtxt:

import numpy as np
from StringIO import StringIO # for example file

a = StringIO("""family  id  mum  dad  rs1  rs2  rs3  rs4  rs5  rs6  rs7  rs8  rs9  rs10  rs11  rs12
1  1   4   6   A T  A A  T T  C C  G G  A T  A G  A A  G A  T A  G G  C C 
2  2   7   9   T A  G A  C T  C T  G A  T T  A A  A C  G G  T A  C C  C T 
3  3   2   8   T T  G G  C T  C T  G G  A T  A G  A C  G G  T T  C C  C C 
4  4   5   1   A A  A A  T T  C C  G A  T T  A A  A A  G A  T A  G C  C T """)


nrs = 12        # number of `rs` columns, for dtype
dt = 'int,'*4 + 'S10,'*nrs

A = np.genfromtxt(a, delimiter='  ', names=True, dtype=dt)

A:

array([ (1, 1, 4, 6, ' A T', 'A A', 'T T', 'C C', 'G G', 'A T', 'A G', 'A A', 'G A', 'T A', 'G G', 'C C'),
       (2, 2, 7, 9, ' T A', 'G A', 'C T', 'C T', 'G A', 'T T', 'A A', 'A C', 'G G', 'T A', 'C C', 'C T'),
       (3, 3, 2, 8, ' T T', 'G G', 'C T', 'C T', 'G G', 'A T', 'A G', 'A C', 'G G', 'T T', 'C C', 'C C'),
       (4, 4, 5, 1, ' A A', 'A A', 'T T', 'C C', 'G A', 'T T', 'A A', 'A A', 'G A', 'T A', 'G C', 'C T')], 
      dtype=[('family', '<i8'), ('id', '<i8'), ('mum', '<i8'), ('dad', '<i8'), ('rs1', 'S10'), ('rs2', 'S10'), ('rs3', 'S10'), ('rs4', 'S10'), ('rs5', 'S10'), ('rs6', 'S10'), ('rs7', 'S10'), ('rs8', 'S10'), ('rs9', 'S10'), ('rs10', 'S10'), ('rs11', 'S10'), ('rs12', 'S10')])

Then to access only the columns from B, do something like this:

b = StringIO("""1  rs1 2345
1  rs2 2346
2  rs5 2348
4  rs8 2351
4 rs12 2360""")

B = np.genfromtxt(b, usecols=[1], dtype='S10')

Now, use A[B]:

A[B]
array([(' A T', 'A A', 'G G', 'A A', 'C C'),
       (' T A', 'G A', 'G A', 'A C', 'C T'),
       (' T T', 'G G', 'G G', 'A C', 'C C'),
       (' A A', 'A A', 'G A', 'A A', 'C T')], 
      dtype=[('rs1', 'S10'), ('rs2', 'S10'), ('rs5', 'S10'), ('rs8', 'S10'), ('rs12', 'S10')])

Or, if you want the first four columns too:

A[['family', 'id', 'mum', 'dad'] + list(B)]
array([(1, 1, 4, 6, ' A T', 'A A', 'G G', 'A A', 'C C'),
       (2, 2, 7, 9, ' T A', 'G A', 'G A', 'A C', 'C T'),
       (3, 3, 2, 8, ' T T', 'G G', 'G G', 'A C', 'C C'),
       (4, 4, 5, 1, ' A A', 'A A', 'G A', 'A A', 'C T')], 
      dtype=[('family', '<i8'), ('id', '<i8'), ('mum', '<i8'), ('dad', '<i8'), ('rs1', 'S10'), ('rs2', 'S10'), ('rs5', 'S10'), ('rs8', 'S10'), ('rs12', 'S10')])

这篇关于结合下numpy的阵列一个头两列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

结合下numpy的阵列一个头两列 [英] Combine two columns under one header in Numpy array

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

结合下numpy的阵列一个头两列 [英] Combine two columns under one header in Numpy array

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭