生成给定百分位的分布 [英] Generate distribution given percentile ranks

查看:27
本文介绍了生成给定百分位的分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在给定以下 分数和百分位排名的情况下生成 R 中的分布强>.

I'd like to generate a distribution in R given the following score and percentile ranks.

x <- 1:10
PercRank <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)

PercRank = 1 例如告诉 1% 的数据具有 value/score <=1(x 的第一个值).类似地,PercRank = 7 表明 7% 的数据具有 value/score <=2 等.

PercRank = 1 for example tells that 1% of the data has a value/score <= 1 (the first value of x). Similarly, PercRank = 7 tells that 7% of the data has a value/score <= 2 etc..

我不知道如何找到基础分布.如果我能得到一些关于如何从这么多信息中获取基础发行版的 pdf 的指导,我会很高兴.

I am not aware of how one could find the underlying distribution. I'd be glad if I could get some guidance on how to go about obtaining the pdf of the underlying distribution from just this much information.

推荐答案

来自 维基百科:

分数的百分位等级是频率分布中分数与其相同或低于该分数的百分比.

The percentile rank of a score is the percentage of scores in its frequency distribution that are the same or lower than it.

为了说明这一点,让我们创建一个分布,比如说,正态分布,其中 mean=2sd=2,以便我们稍后测试(我们的代码).

In order to illustrate this, let's create a distribution, say, a normal distribution, with mean=2 and sd=2, so that we can test (our code) later.

# 1000 samples from normal(2,2)
x1 <- rnorm(1000, mean=2, sd=2)

现在,让我们采用您在帖子中提到的相同百分位排名.让我们将它除以 100,这样它们就代表了累积概率.

Now, let's take the same percentile rank you've mentioned in your post. Let's divide it by 100 so that they represent cumulative probabilities.

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100

这些百分位数对应的值(scores)是什么?

And what are the values (scores) corresponding to these percentiles?

# generating values similar to your x.
x <- c(t(quantile(x1, cum.p)))
> x
 [1] -2.1870396 -1.4707273 -1.1535935 -0.8265444 -0.2888791  
         0.2781699  0.5893503  0.8396868  1.4222489  2.1519328

这意味着 1% 的数据小于 -2.18.7% 的数据小于 -1.47 等等......现在,我们有 xcum.p(相当于你的 PercRank).让我们忘记 x1 以及这应该是正态分布的事实.为了找出它可能是什么分布,让我们通过使用 diff 从累积概率中获得实际概率,它采用第 n 个和第 (n-1) 个元素之间的差异.

This means that 1% of the data is lesser than -2.18. 7% of the data is lesser than -1.47 etc... Now, we have the x and cum.p (equivalent to your PercRank). Let's forget x1 and the fact that this should be a normal distribution. To find out what distribution it could be, let's get actual probabilities from the cumulative probabilities by using diff that takes the difference between nth and (n-1)th element.

prob <- c( cum.p[1], diff(cum.p), .01)
> prob
# [1] 0.01 0.06 0.05 0.11 0.18 0.21 0.11 0.07 0.12 0.07 0.01

现在,我们要做的就是为每个区间 x (x[1]:x[2], x[2]]:x[3] ...) 然后最后从这个庞大的数据中采样你需要的尽可能多的点数(比如 10000),概率如上所述.

Now, all we have to do is is to generate samples of size, say, 100 (could be any number), for each interval of x (x[1]:x[2], x[2]:x[3] ...) and then finally sample from this huge data as many number of points as you need (say, 10000), with probabilities mentioned above.

这可以通过:

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 5) 
fin  <- abs(max(x)) + 5

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)

现在,我们有来自分布的 10000 个样本.让我们来看看它是怎么回事.它应该类似于均值 = 2 和 sd = 2 的正态分布.

Now, we have 10000 samples from the distribution. Let's look at how it is. It should resemble a normal distribution with mean = 2 and sd = 2.

> hist(out)

> c(mean(out), sd(out))
# [1] 1.954834 2.170683

这是一个正态分布(来自直方图),mean = 1.95sd = 2.17 (~ 2).

It is a normal distribution (from the histogram) with mean = 1.95 and sd = 2.17 (~ 2).

注意:我所解释的有些事情可能是迂回的和/或代码可能/可能不"适用于其他一些发行版.这篇文章的目的只是通过一个简单的例子来解释这个概念.

Note: Some things what I've explained may have been roundabout and/or the code "may/may not" work with some other distributions. The point of this post was just to explain the concept with a simple example.

为了澄清 @Dwin's 点,我尝试了与 x = 1:10 对应于 OP 问题的相同代码, 用相同的代码替换 x 的值.

In an attempt to clarify @Dwin's point, I tried the same code with x = 1:10 corresponding to OP's question, with the same code by replacing the value of x.

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100
prob <- c( cum.p[1], diff(cum.p), .01)
x <- 1:10

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 1) 
fin  <- abs(max(x)) + 1

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)

> quantile(out, cum.p) # ~ => x = 1:10
# 1%     7%    12%    23%    41%    62%    73%    80%    92%    99% 
# 0.878  1.989  2.989  4.020  5.010  6.030  7.030  8.020  9.050 10.010 

> hist(out)

这篇关于生成给定百分位的分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆