仅包括数据帧中每个列的离群值 [英] Only include outliers from each column in a dataframe

查看：66 发布时间：2017/3/12 11:52:50 r data.table

本文介绍了仅包括数据帧中每个列的离群值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个dataframe如下：

I have a dataframe as follows:

 chr   leftPos         TBGGT     12_try      324Gtt       AMN2
  1     24352           34         43          19         43
  1     53534           2          1           -1         -9
  2      34            -15         7           -9         -18
  3     3443           -100        -4          4          -9
  3     3445           -100        -1          6          -1
  3     3667            5          -5          9           5
  3     7882           -8          -9          1           3

我必须创建一个循环：

a）从第三列开始计算每一列的上限和下限（UL和LL）。 >
b）只包括落在UL和LL（Zoutliers）之外的行。

c）然后计算Zoutlier是相同方向的行数即正或负）作为同一chr 的上一个或后续行。

a) Calculates the upper and lower limit (UL and LL) for each column from the third column onwards.
b) Only includes rows that fall outside of the UL and LL (Zoutliers).
c) Then count the number of rows where the Zoutlier is the same direction (i.e. positive or negative) as the previous or the subsequent row for the same chr.

因此输出为：

ZScore1 TBGGT 12_try 324Gtt AMN2 nrow 4 6 4 4

到目前为止，我的代码如下：

So far I have code as follows:

library(data.table)#v1.9.5 f1 <- function(df, ZCol){ #A) Determine the UL and LL and then generate the Zoutliers UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE) LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE) Zoutliers <- which(ZCol > UL | ZCol < LL) #B) Exclude Zoutliers per chr if same direction as previous or subsequent row na.omit(as.data.table(df)[, {tmp = sign(eval(as.name(ZCol))) .SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')]}, by=chr])[, list(.N)]} nm1 <- paste0(names(df) setnames(do.call(cbind,lapply(nm1, function(x) f1(df, x))), nm1)[]

代码从各个地方修补。我有的问题是组合代码的A）和B）得到我想要的输出

The code is patched together from various places. The problem I have is combining parts A) and B) of the code to get the output I want

推荐答案

功能？我不确定 alpha 是什么，所以我无法重现预期的输出，并将其作为变量包含在函数中。

Can you try this function? I was not sure what alpha is, so I could not reproduce the expected output and included it as variable in the function.

# read your data per copy&paste d <- read.table("clipboard",header = T) # or as in Frank comment mentioned solution via fread d <- data.table::fread("chr leftPos TBGGT 12_try 324Gtt AMN2 1 24352 34 43 19 43 1 53534 2 1 -1 -9 2 34 -15 7 -9 -18 3 3443 -100 -4 4 -9 3 3445 -100 -1 6 -1 3 3667 5 -5 9 5 3 7882 -8 -9 1 3") # set up the function foo <- function(x, alpha, chr){ # your code for task a) and b) UL = median(x, na.rm = TRUE) + alpha*IQR(x, na.rm = TRUE) LL = median(x, na.rm = TRUE) - alpha*IQR(x, na.rm = TRUE) Zoutliers <- which(x > UL | x < LL) # part (c # factor which specifies the direction. 0 values are set as positives pos_neg <- ifelse(x[Zoutliers] >= 0, "positive", "negative") # count the occurrence per chromosome and direction. aggregate(x[Zoutliers], list(chr[Zoutliers], pos_neg), length) } # apply over the columns and get a list of dataframes with number of outliers per chr and direction. apply(d[,3:ncol(d)], 2, foo, 0.95, d$chr)

这篇关于仅包括数据帧中每个列的离群值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

仅包括数据帧中每个列的离群值 [英] Only include outliers from each column in a dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

仅包括数据帧中每个列的离群值 [英] Only include outliers from each column in a dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭