仅包括数据帧中每个列的离群值 [英] Only include outliers from each column in a dataframe
问题描述
我有一个dataframe如下:
I have a dataframe as follows:
chr leftPos TBGGT 12_try 324Gtt AMN2
1 24352 34 43 19 43
1 53534 2 1 -1 -9
2 34 -15 7 -9 -18
3 3443 -100 -4 4 -9
3 3445 -100 -1 6 -1
3 3667 5 -5 9 5
3 7882 -8 -9 1 3
我必须创建一个循环:
a)从第三列开始计算每一列的上限和下限(UL和LL)。 >
b)只包括落在UL和LL(Zoutliers)之外的行。
c)然后计算Zoutlier是相同方向的行数即正或负)作为同一chr 的上一个或后续行。
a) Calculates the upper and lower limit (UL and LL) for each column from the third column onwards.
b) Only includes rows that fall outside of the UL and LL (Zoutliers).
c) Then count the number of rows where the Zoutlier is the same direction (i.e. positive or negative) as the previous or the subsequent row for the same chr.
因此输出为:
ZScore1 TBGGT 12_try 324Gtt AMN2
nrow 4 6 4 4
到目前为止,我的代码如下:
So far I have code as follows:
library(data.table)#v1.9.5
f1 <- function(df, ZCol){
#A) Determine the UL and LL and then generate the Zoutliers
UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
Zoutliers <- which(ZCol > UL | ZCol < LL)
#B) Exclude Zoutliers per chr if same direction as previous or subsequent row
na.omit(as.data.table(df)[, {tmp = sign(eval(as.name(ZCol)))
.SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')]},
by=chr])[, list(.N)]}
nm1 <- paste0(names(df)
setnames(do.call(cbind,lapply(nm1, function(x) f1(df, x))), nm1)[]
代码从各个地方修补。我有的问题是组合代码的A)和B)得到我想要的输出
The code is patched together from various places. The problem I have is combining parts A) and B) of the code to get the output I want
推荐答案
功能?我不确定 alpha
是什么,所以我无法重现预期的输出,并将其作为变量包含在函数中。
Can you try this function? I was not sure what alpha
is, so I could not reproduce the expected output and included it as variable in the function.
# read your data per copy&paste
d <- read.table("clipboard",header = T)
# or as in Frank comment mentioned solution via fread
d <- data.table::fread("chr leftPos TBGGT 12_try 324Gtt AMN2
1 24352 34 43 19 43
1 53534 2 1 -1 -9
2 34 -15 7 -9 -18
3 3443 -100 -4 4 -9
3 3445 -100 -1 6 -1
3 3667 5 -5 9 5
3 7882 -8 -9 1 3")
# set up the function
foo <- function(x, alpha, chr){
# your code for task a) and b)
UL = median(x, na.rm = TRUE) + alpha*IQR(x, na.rm = TRUE)
LL = median(x, na.rm = TRUE) - alpha*IQR(x, na.rm = TRUE)
Zoutliers <- which(x > UL | x < LL)
# part (c
# factor which specifies the direction. 0 values are set as positives
pos_neg <- ifelse(x[Zoutliers] >= 0, "positive", "negative")
# count the occurrence per chromosome and direction.
aggregate(x[Zoutliers], list(chr[Zoutliers], pos_neg), length)
}
# apply over the columns and get a list of dataframes with number of outliers per chr and direction.
apply(d[,3:ncol(d)], 2, foo, 0.95, d$chr)
这篇关于仅包括数据帧中每个列的离群值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!