R分布图,NA数据和阈值 [英] R distribution plot with NA data and thresholds

查看:182
本文介绍了R分布图,NA数据和阈值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大数据文件的形式:

I have a large data file in the form:

Input_SNP   Set_1    Set_2     Set_3     Set_4     Set_5     Set_6
1.09        0.162    NA        2.312     1.876     0.12      0.812
0.687       NA       0.987     1.32      1.11      1.04      NA
NA          1.890    0.923     1.43      0.900     2.02      2.7
2.801       0.642    0.791     0.812     NA        0.31      1.60
1.33        1.33     NA        1.22      0.23      0.18      1.77
2.91        1.00     1.651     NA        1.55      3.20      0.99
2.00        2.31     0.89      1.13      1.25      0.12      1.55

我想在每一列超过2.0的分发总数。例如,Set_1> 2 = 1,Set_2> 2 = 0,Set_3> 2 = 1。问题是每列具有随机数量的缺失数据(NA)。所以弄乱了分配。看来我唯一的选择是分配百分比。例如:Set_1> 2 = 1/6,Set_2> 2 = 0/5,Set_3> 2 = 1/6。我想将这些百分比的分布形式分解成一个圆括号直方图的钟形曲线。尽管我的例子,每列超过2的百分比应该在0.00%和3.00%之间,所以0.05的大小是很好的。然后我想绘制我的Input_SNP百分比,以得到一个p值。你们知道在R里怎么做吗?目前,这是一个data.frame文件和.csv?

I would like to make a distribution of the totals in each column that are over 2.0. For example, Set_1 > 2 = 1, Set_2 > 2 = 0, Set_3 > 2 = 1. The issue is that each column has a "random" amount of missing data (NA). So that messes up the distribution. It seems my only option is to do a distribution of percentages. For example: Set_1 > 2 = 1/6, Set_2 > 2 = 0/5, Set_3 > 2 = 1/6. I would like to make a distribution of these percentages into a bell-curve of binned histogram. Despite my example, the percentages in each column over 2 should be between 0.00% and 3.00% so bins of size 0.05 would be nice. I would then like to plot my Input_SNP percentage on that distribution to get a p-value. Do you guys know how to do this in R? Currently this is in both a data.frame file and a .csv?

我一直在尝试: hist(colSums(as.matrix(df )> 2)),但没有工作(我认为是因为NAs)。那么我如何才能合并呢?

I had been trying: hist(colSums(as.matrix(df) > 2)) but that had not been working (I think because of the NAs). So how can I incorporate that?

我想要的输出是每列的百分比直方图,直方图可以是0.05。

My desired output is a histogram of percentages of each column that is over 2. The bins in the histogram can be 0.05.

推荐答案

假设您的数据位于 data.frame 调用 df

Perhaps you could try this, assuming your data is in a data.frame called df:

result <- unlist(lapply(sapply(df, function(x) which(x>2)), function(x) length(x)))
result
#Input_SNP     Set_1     Set_2     Set_3     Set_4     Set_5     Set_6 
#    2         1         0         1         0         2         1 

在现实中,这是一个3步的过程,第一个结果< - sapply(df,function(x)which(x> 2)将给你以下结构:

In reality this is a 3 step process, first result <- sapply(df, function(x) which(x>2) will give you the following structure:

#List of 7
#$ Input_SNP: int [1:2] 4 6
#$ Set_1    : int 7
#$ Set_2    : int(0) 
#$ Set_3    : int 1
#$ Set_4    : int(0) 
#$ Set_5    : int [1:2] 3 6
#$ Set_6    : int 3

这是插入到一个 lapply()以下形式:

And this is inserted in a lapply() of the following form:

lapply(result, function(x) length(x))

对于以下结构:

#List of 7
#$ Input_SNP: int 2
#$ Set_1    : int 1
#$ Set_2    : int 0
#$ Set_3    : int 1
#$ Set_4    : int 0
#$ Set_5    : int 2
#$ Set_6    : int 1

最后,这是不公开的最终表单。

Finally this is unlisted for the final form.

如果 Input_SNP 不应该是所需结果的一部分,将其从 sapply() df 中删除c $ c>,像这样:

If Input_SNP should not be part of the desired result, remove it from the df inside the sapply(), like so:

unlist(lapply(sapply(df[,-1], function(x) which(x>2)), function(x) length(x)))
#Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 
#1     0     1     0     2     1 

最后为比例:

result/colSums(!is.na(df[,-1]))
#    Set_1     Set_2     Set_3     Set_4     Set_5     Set_6 
#0.1666667 0.0000000 0.1666667 0.0000000 0.2857143 0.1666667 

这篇关于R分布图,NA数据和阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆