R:将数据集分解成四分位数/十分位数。什么是正确的方法? [英] R: splitting dataset into quartiles/deciles. What is the right method?
问题描述
我有一个包含三个变量的数据数组。
gene_id fpkm meth_val
1 100629094 0.000 0.0063
2 100628995 0.000 0.0000
3 102655614 111.406 0.0021
我想将基于fpkm的gene_ids分为四分位数或十进制数的平均值。
将数据加载到数据框中后,我想绘制平均值。 ..
data< - read.delim(myfile.tsv,sep ='\t')
我可以使用以下方式确定fpkm十进制数:
quantile(data $ fpkm,prob = seq(0,1,length = 11),type = 5
其中
0%10%20%30%40%50%
0.000000e + 00 9.783032e-01 7.566164e + 00 3.667630e + 01 1.379986e + 02 3.076280e + 02
60%70%80%90%100%
5.470552e + 02 8.8 75592e + 02 1.486200e + 03 2.974264e + 03 1.958740e + 05
从那里,我会喜欢基于fpkm_val是否符合这些十分位数之一,将数据框基本上分为10组。然后,我想在ggplot中绘制每个十分位数的meth_val作为一个框图,并在十进制位上执行统计测试。
我真的坚持的主要事情是如何以正确的方式拆分我的数据集。任何帮助将非常感谢!
感谢一堆!
另一种方式将是 dplyr
中的 ntile()
。
library(tidyverse)
foo< - data.frame(a = 1:100,
b = runif(100,50,200) ,
stringsAsFactors = FALSE)
foo%>%
mutate(quantile = ntile(b,10))
#ab quantile
#1 1 93.94754 2
#2 2 172.51323 8
#3 3 99.79261 3
#4 4 81.55288 2
#5 5 116.59942 5
#6 6 128.75947 6
I am very new with R, so hoping I can get some pointers on how to achieve the desired manipulation of my data.
I have an array of data with three variables.
gene_id fpkm meth_val
1 100629094 0.000 0.0063
2 100628995 0.000 0.0000
3 102655614 111.406 0.0021
I'd like to plot the average meth_val after stratifying my gene_ids based on fpkm into quartiles or deciles.
Once I load my data into a dataframe...
data <- read.delim("myfile.tsv", sep='\t')
I can determine the fpkm deciles using:
quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5
which yields
0% 10% 20% 30% 40% 50%
0.000000e+00 9.783032e-01 7.566164e+00 3.667630e+01 1.379986e+02 3.076280e+02
60% 70% 80% 90% 100%
5.470552e+02 8.875592e+02 1.486200e+03 2.974264e+03 1.958740e+05
From there, I'd like to essentially split the dataframe into 10 groups based on whether the fpkm_val fits into one of these deciles. Then I'd like to plot the meth_val of each decile in ggplot as a box plot and perform a statistical test across deciles.
The main thing I'm really stuck on is how to split my dataset in the proper way. Any assistance would be hugely appreciated!
Thanks a bunch!
Another way would be ntile()
in dplyr
.
library(tidyverse)
foo <- data.frame(a = 1:100,
b = runif(100, 50, 200),
stringsAsFactors = FALSE)
foo %>%
mutate(quantile = ntile(b, 10))
# a b quantile
#1 1 93.94754 2
#2 2 172.51323 8
#3 3 99.79261 3
#4 4 81.55288 2
#5 5 116.59942 5
#6 6 128.75947 6
这篇关于R:将数据集分解成四分位数/十分位数。什么是正确的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!