计算每个类别的列的发生量 [英] Calculate amount of occurences for column per category

查看:114
本文介绍了计算每个类别的列的发生量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算每个SNP名称在iets列中出现的Opp的数量(最终我想将Opp的出现次数除以df $ MM。)

I'm trying to calculate the amount of occurences of "Opp" in the iets column per SNP name (eventually I want to divide the amount of occurences of "Opp" by df$MM.)

library(data.table)
df <- structure(list(SNP = structure(c(1L, 1L, 1L, 2L, 1L), .Label = c("rs80932150", "rs000001"), class = "factor"), FID = c(116601888L, 116621563L, 117253533L, 118635095L, 118943247L), IID = c(116601888L, 116621563L, 117253533L, 118635095L, 118943247L), NEW = structure(c(16L, 14L, 16L, 14L, 14L), .Label = c("A/A", "A/C", "A/G", "A/T", "C/A", "C/C", "C/G", "C/T", "G/A", "G/C", "G/G", "G/T", "T/A", "T/C", "T/G", "T/T"), class = "factor"), OLD = structure(c(6L, 6L, 6L, 6L, 6L), .Label = c("A/A", "A/C", "A/G", "A/T", "C/A", "C/C", "C/G", "C/T", "G/A", "G/C", "G/G", "G/T", "T/A", "T/C", "T/G", "T/T"), class = "factor"), count = c(1L, 1L, 1L, 1L, 1L), MM = c(4L, 4L, 4L, 1L, 4L), iets = c("Opp", "Het", "Opp", "Het", "Het")), .Names = c("SNP", "FID", "IID", "NEW", "OLD", "count", "MM", "iets"), class = "data.frame", row.names = c(NA, -5L))
setDT(df)

#         SNP       FID       IID NEW OLD count MM iets
#1 rs80932150 116601888 116601888 T/T C/C     1  4  Opp
#2 rs80932150 116621563 116621563 T/C C/C     1  4  Het
#3 rs80932150 117253533 117253533 T/T C/C     1  4  Opp
#4   rs000001 118635095 118635095 T/C C/C     1  1  Het
#5 rs80932150 118943247 118943247 T/C C/C     1  4  Het

我的预期结果如下:

df
#          SNP       FID       IID NEW OLD count MM iets oppcount percentage
#1: rs80932150 116601888 116601888 T/T C/C     1  4  Opp      2        0.5
#2: rs80932150 116621563 116621563 T/C C/C     1  4  Het      2        0.5
#3: rs80932150 117253533 117253533 T/T C/C     1  4  Opp      2        0.5
#4:   rs000001 118635095 118635095 T/C C/C     1  1  Het      0        0.0
#5: rs80932150 118943247 118943247 T/C C/C     1  4  Het      2        0.5

我一直在尝试类似的东西, t似乎想出如何分配出现的值到我的对手/百分比列。

首先我要计算每个SNP的Opp的数量,然后除以MM。

I've been trying things similar to this, however I can't seem to figure out how to assign the occurence values to my oppcount/percentage column.
First I would have to count the amount of "Opp" per SNP, and then divide it by MM.

as.character((sum(df$iets == "Opp")/(df[,.N, by = df$SNP][[2]])))
#[1] "0.5" "2"  

如何计算每个SNP(类别)出现的Opp的金额?

How can I calculate the amount of occurences of "Opp" per SNP (category)?

推荐答案

code>:= 运算符引用code> data.table 。使用:

You can update your data.table by reference with the := operator. With:

df[, `:=` (oppcount = sum(iets=='Opp'), percentage = sum(iets=='Opp')/.N), by = SNP]

p>

you get:

> df
          SNP       FID       IID NEW OLD count MM iets oppcount percentage
1: rs80932150 116601888 116601888 T/T C/C     1  4  Opp        2        0.5
2: rs80932150 116621563 116621563 T/C C/C     1  4  Het        2        0.5
3: rs80932150 117253533 117253533 T/T C/C     1  4  Opp        2        0.5
4:   rs000001 118635095 118635095 T/C C/C     1  1  Het        0        0.0
5: rs80932150 118943247 118943247 T/C C/C     1  4  Het        2        0.5

或者,根据@Frank在您也可以使用以下两个选项之一:

Or, based on the suggestion by @Frank in the comments, you could also use one of the following two options:

# method 1
df[, c('oppcount', 'percentage') := {s = sum(iets=='Opp'); .(s, s/.N)}, by = SNP]
# method 2
df[df[, {s = sum(iets=='Opp'); .(oppcount = s, percentage = s/.N)}, by = SNP], on = 'SNP']






基本R选项:


A base R alternative:

transform(df,
          oppcount = ave(iets, SNP, FUN = function(x) sum(x=='Opp')),
          percentage = ave(iets, SNP, FUN = function(x) sum(x=='Opp')/length(x)))






正确的 dplyr 替代方案是:

library(dplyr)
df %>% 
  group_by(SNP) %>% 
  mutate(oppcount = sum(iets=='Opp'),
         percentage = oppcount/n())

这篇关于计算每个类别的列的发生量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆