R中的四配子测试 [英] Four-Gamete-Test in R

查看:90
本文介绍了R中的四配子测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有(将有)数据,如下所示:

I have (will have) data, that looks like the following:

Individual Nuk Name       Position Individual.1 Nuk.1 Name.1     Position.1
Ind 1      A   Locus_1988 23       Ind 1        A     Locus_3333 15
Ind 2      A   Locus_1988 23       Ind 2        G     Locus_3333 15
Ind 3      G   Locus_1988 23       Ind 3        A     Locus_3333 15
Ind 4      G   Locus_1988 23       Ind 4        -     Locus_3333 15
Ind 5      A   Locus_1988 23       Ind 5        G     Locus_3333 15
Ind 6      G   Locus_1988 23       Ind 6        G     Locus_3333 15
Ind 1      C   Locus_1988 23       Ind 1        C     Locus_3333 18
Ind 2      T   Locus_1988 23       Ind 2        C     Locus_3333 18
Ind 3      T   Locus_1988 23       Ind 3        T     Locus_3333 18
Ind 4      C   Locus_1988 23       Ind 4        -     Locus_3333 18
Ind 5      -   Locus_1988 23       Ind 5        C     Locus_3333 18
Ind 6      T   Locus_1988 23       Ind 6        T     Locus_3333 18
Ind 1      T   Locus_2301 12       Ind 1        T     Locus_4123 38
Ind 2      T   Locus_2301 12       Ind 2        T     Locus_4123 38
Ind 3      A   Locus_2301 12       Ind 3        -     Locus_4123 38
Ind 4      -   Locus_2301 12       Ind 4        A     Locus_4123 38
Ind 5      A   Locus_2301 12       Ind 5        A     Locus_4123 38
Ind 6      T   Locus_2301 12       Ind 6        T     Locus_4123 38
Ind 1      G   Locus_2301 31       Ind 1        G     Locus_4123 52
Ind 2      C   Locus_2301 31       Ind 2        C     Locus_4123 52
Ind 3      C   Locus_2301 31       Ind 3        G     Locus_4123 52
Ind 4      G   Locus_2301 31       Ind 4        C     Locus_4123 52
Ind 5      -   Locus_2301 31       Ind 5        C     Locus_4123 52
Ind 6      G   Locus_2301 31       Ind 6        -     Locus_4123 52

数据被构建为成对的基因座(因此在abo中ve例如Locus_1988和Locus_3333是一对)。对于一对中的每个位置,我需要在Nuk上进行四配子测试(FGT),即从四个可能的字母GCAT对任何给定的2个字母组合的所有可能的2对组合进行测试。
因此,对于上述数据,对于 Locus_1988位置23 + Locus_3333位置15 对,存在的组合为 AA AG GA G- AG GG 。由于存在AA,AG,GA和GG组合,因此这对将通过FGT,并且需要注册(即在new_column中为1)。
上面数据中的下一个组是 Locus_1988位置23 + Locus_3333 位置18具有以下组合: CC TC TT C- -C TT 。由于缺少组合CT,因此该组将不通过FGT(在new_column中注册为0)。

The data is built up as pairs of loci (so in the above e.g. Locus_1988 and Locus_3333 is a pair). For each of the positions within a pair, I need to do a Four-Gamete Test (FGT) on the Nuk, i.e. test in all possible 2-pair combinations of any given 2-letter combination from the four possible letters GCAT. So for the data above, for the pair Locus_1988 Position 23 + Locus_3333 Position 15 the combinations present are AA AG GA G- AG GG. As the combinations AA, AG, GA and GG are present, this pair will have passed the FGT), and this needs to be registered (i.e. with a 1 in a new_column). The next group in the above data is Locus_1988 Position 23 + Locus_3333 Position 18 has the following combinations: CC TC TT C- -C TT. As the combination CT is missing, this group will not have passed the FGT (registered as 0 in the new_column).

您将如何进行此测试?

How would you proceed to do this test?

有很多位点,每个位点有很多(30)个人,并且在一些但不是全部位点中有几个位置需要测试。

There are many loci, with many (30) individuals in each, and several positions within some, but not all loci, to be tested.

我在想,应该可以按照以下方式建立测试:

I am thinking, that it should be possible to build the test along the lines of this:

if(grepl ( AG& GA& AA& GG | AC& CA& AA& CC | AT& TA& AA& TT | CT& TC& CC& TT | CG& GC& CC& GG | GT& ; TG& GG& TT,data =两列的组合))print( 1)else print( 0)

但我显然不被允许使用& |操作员。
同样,我在弄清楚如何指定要首先参考名称和位置的操作时遇到很多麻烦。
您能否在新列中为每个组赋予唯一的名称(如下所示),并指定对每个组进行测试?

But I am apparently not allowed to use the & | operators. Also I'm having a lot of trouble figuring out how to specify to do this with reference to firstly the Name and secondly the Position. Would you give each group a unique name in a new column (as below), and specify to do the test on each group?

Individual Nuk Name       Pos Individual.1 Nuk.1 Name.1          Pos.1 Grp
Ind 1      A   Locus_1988 23       Ind 1        A     Locus_3333 15    1         
Ind 2      A   Locus_1988 23       Ind 2        G     Locus_3333 15    1
Ind 3      G   Locus_1988 23       Ind 3        A     Locus_3333 15    1
Ind 4      G   Locus_1988 23       Ind 4        -     Locus_3333 15    1
Ind 5      A   Locus_1988 23       Ind 5        G     Locus_3333 15    1
Ind 6      G   Locus_1988 23       Ind 6        G     Locus_3333 15    1
Ind 1      C   Locus_1988 23       Ind 1        C     Locus_3333 18    2
Ind 2      T   Locus_1988 23       Ind 2        C     Locus_3333 18    2
Ind 3      T   Locus_1988 23       Ind 3        T     Locus_3333 18    2
Ind 4      C   Locus_1988 23       Ind 4        -     Locus_3333 18    2
Ind 5      -   Locus_1988 23       Ind 5        C     Locus_3333 18    2
Ind 6      T   Locus_1988 23       Ind 6        T     Locus_3333 18    2
Ind 1      T   Locus_2301 12       Ind 1        T     Locus_4123 38    3
Ind 2      T   Locus_2301 12       Ind 2        T     Locus_4123 38    3
Ind 3      A   Locus_2301 12       Ind 3        -     Locus_4123 38    3
Ind 4      -   Locus_2301 12       Ind 4        A     Locus_4123 38    3
Ind 5      A   Locus_2301 12       Ind 5        A     Locus_4123 38    3
Ind 6      T   Locus_2301 12       Ind 6        T     Locus_4123 38    3
Ind 1      G   Locus_2301 31       Ind 1        G     Locus_4123 52    4
Ind 2      C   Locus_2301 31       Ind 2        C     Locus_4123 52    4
Ind 3      C   Locus_2301 31       Ind 3        G     Locus_4123 52    4
Ind 4      G   Locus_2301 31       Ind 4        C     Locus_4123 52    4
Ind 5      -   Locus_2301 31       Ind 5        C     Locus_4123 52    4
Ind 6      G   Locus_2301 31       Ind 6        -     Locus_4123 52    4

我是认为可以循环执行此操作,但是由于我有很多数据,恐怕这可能需要很长时间才能处理。

I'm thinking this could be done in a loop, but I'm afraid this might take a long time to process, as I have a lot of data.

推荐答案

按位置和轨迹名称拆分数据( df1 ):

Split the data (df1) by positions and locus names:

split1 <- split(df1, list(df1$Name, df1$Position, df1$Name.1, df1$Position.1), drop = TRUE)

创建测试:

do.call(rbind, 
  lapply(split1, function(x) {
    all_letters <- union( x$Nuk, x$Nuk.1 )
    all_letters <- all_letters[all_letters != "-"]
    letter_comb <- expand.grid(all_letters, all_letters, stringsAsFactors = FALSE)
    data.frame( 
      FGT = all(
        sapply( seq_len(nrow(letter_comb)), function(i) {
          any(x$Nuk == letter_comb[i,1] & x$Nuk.1 == letter_comb[i,2])
        })
      ),
      Name = x$Name[1], Position = x$Position[1], 
      Name.1 = x$Name.1[1], Position.1 = x$Position.1[1] 
    )  
  })
)

结果:

#                               FGT       Name Position     Name.1 Position.1
# Locus_1988.23.Locus_3333.15  TRUE Locus_1988       23 Locus_3333         15
# Locus_1988.23.Locus_3333.18 FALSE Locus_1988       23 Locus_3333         18
# Locus_2301.12.Locus_4123.38 FALSE Locus_2301       12 Locus_4123         38
# Locus_2301.31.Locus_4123.52  TRUE Locus_2301       31 Locus_4123         52

这篇关于R中的四配子测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆