R分布图,NA数据和阈值 [英] R distribution plot with NA data and thresholds
问题描述
我有一个大数据文件的形式:
I have a large data file in the form:
Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6
1.09 0.162 NA 2.312 1.876 0.12 0.812
0.687 NA 0.987 1.32 1.11 1.04 NA
NA 1.890 0.923 1.43 0.900 2.02 2.7
2.801 0.642 0.791 0.812 NA 0.31 1.60
1.33 1.33 NA 1.22 0.23 0.18 1.77
2.91 1.00 1.651 NA 1.55 3.20 0.99
2.00 2.31 0.89 1.13 1.25 0.12 1.55
我想在每一列超过2.0的分发总数。例如,Set_1> 2 = 1,Set_2> 2 = 0,Set_3> 2 = 1。问题是每列具有随机数量的缺失数据(NA)。所以弄乱了分配。看来我唯一的选择是分配百分比。例如:Set_1> 2 = 1/6,Set_2> 2 = 0/5,Set_3> 2 = 1/6。我想将这些百分比的分布形式分解成一个圆括号直方图的钟形曲线。尽管我的例子,每列超过2的百分比应该在0.00%和3.00%之间,所以0.05的大小是很好的。然后我想绘制我的Input_SNP百分比,以得到一个p值。你们知道在R里怎么做吗?目前,这是一个data.frame文件和.csv?
I would like to make a distribution of the totals in each column that are over 2.0. For example, Set_1 > 2 = 1, Set_2 > 2 = 0, Set_3 > 2 = 1. The issue is that each column has a "random" amount of missing data (NA). So that messes up the distribution. It seems my only option is to do a distribution of percentages. For example: Set_1 > 2 = 1/6, Set_2 > 2 = 0/5, Set_3 > 2 = 1/6. I would like to make a distribution of these percentages into a bell-curve of binned histogram. Despite my example, the percentages in each column over 2 should be between 0.00% and 3.00% so bins of size 0.05 would be nice. I would then like to plot my Input_SNP percentage on that distribution to get a p-value. Do you guys know how to do this in R? Currently this is in both a data.frame file and a .csv?
我一直在尝试: hist(colSums(as.matrix(df )> 2))
,但没有工作(我认为是因为NAs)。那么我如何才能合并呢?
I had been trying: hist(colSums(as.matrix(df) > 2))
but that had not been working (I think because of the NAs). So how can I incorporate that?
我想要的输出是每列的百分比直方图,直方图可以是0.05。
My desired output is a histogram of percentages of each column that is over 2. The bins in the histogram can be 0.05.
推荐答案
假设您的数据位于 data.frame
调用 df
:
Perhaps you could try this, assuming your data is in a data.frame
called df
:
result <- unlist(lapply(sapply(df, function(x) which(x>2)), function(x) length(x)))
result
#Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6
# 2 1 0 1 0 2 1
在现实中,这是一个3步的过程,第一个结果< - sapply(df,function(x)which(x> 2)
将给你以下结构:
In reality this is a 3 step process, first result <- sapply(df, function(x) which(x>2)
will give you the following structure:
#List of 7
#$ Input_SNP: int [1:2] 4 6
#$ Set_1 : int 7
#$ Set_2 : int(0)
#$ Set_3 : int 1
#$ Set_4 : int(0)
#$ Set_5 : int [1:2] 3 6
#$ Set_6 : int 3
这是插入到一个 lapply()
以下形式:
And this is inserted in a lapply()
of the following form:
lapply(result, function(x) length(x))
对于以下结构:
#List of 7
#$ Input_SNP: int 2
#$ Set_1 : int 1
#$ Set_2 : int 0
#$ Set_3 : int 1
#$ Set_4 : int 0
#$ Set_5 : int 2
#$ Set_6 : int 1
最后,这是不公开的最终表单。
Finally this is unlisted for the final form.
如果 Input_SNP
不应该是所需结果的一部分,将其从 sapply()$中的
df
中删除c $ c>,像这样:
If Input_SNP
should not be part of the desired result, remove it from the df
inside the sapply()
, like so:
unlist(lapply(sapply(df[,-1], function(x) which(x>2)), function(x) length(x)))
#Set_1 Set_2 Set_3 Set_4 Set_5 Set_6
#1 0 1 0 2 1
最后为比例:
result/colSums(!is.na(df[,-1]))
# Set_1 Set_2 Set_3 Set_4 Set_5 Set_6
#0.1666667 0.0000000 0.1666667 0.0000000 0.2857143 0.1666667
这篇关于R分布图,NA数据和阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!