Python 如何使用 n、min、max、mean、std、25%、50%、75%、Skew、Kurtosis 来定义伪随机概率密度估计/函数? [英] How can Python use n, min, max, mean, std, 25%, 50%, 75%, Skew, Kurtosis to define a psudo-random Probability Density Estimate/Function?

查看:69
本文介绍了Python 如何使用 n、min、max、mean、std、25%、50%、75%、Skew、Kurtosis 来定义伪随机概率密度估计/函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读和试验 numpy.random 时,我似乎无法找到或创建我需要的东西;一个包含 10 个参数的 Python 伪随机值生成器,包括计数、最小值、最大值、平均值、sd、25th%ile、50th%ile(中值)、75th%ile、skew 和 kurtosis.

In reading and experimenting with numpy.random, I can't seem to find or create what I need; a 10 parameter Python pseudo-random value generator including count, min, max, mean, sd, 25th%ile,   50th%ile (median),   75th%ile, skew, and kurtosis.

来自 https://docs.python.org/3/library/random.html 我看到这些分布是均匀分布、正态分布(高斯分布)、对数正态分布、负指数分布、伽马分布和 beta 分布,但我需要直接为仅由我的 10 个参数定义的分布生成值,而没有参考分布家庭.

From https://docs.python.org/3/library/random.html I see these distributions uniform, normal (Gaussian), lognormal, negative exponential, gamma, and beta distributions, though I need to generate values directly to a distribution defined only by my 10 parameters, with no reference to a distribution family.

是否有 numpy.random.xxxxxx(n, min, max, mean, sd, 25%, 50%, 75%, skew, kurtosis) 的文档或作者,或者什么是我可能会修改以实现此目标的最接近的现有源代码?

Is there documentation, or an author(s), of a numpy.random.xxxxxx(n, min, max, mean, sd, 25%, 50%, 75%, skew, kurtosis), or what please is the closest existing source code that I might modify to achieve this goal?

这在某种程度上与 describe() 包括偏斜和峰度相反.我可以进行循环或优化,直到满足随机生成的数字的条件为止,尽管这可能需要无限长的时间来满足我的 10 个参数.

This would be the reverse of describe() including skew and kurtosis in a way.  I could do a loop or optimize until a criteria is met with random generated numbers, though that could take an infinite amount of time to meet my 10 parameters.

我在 R 中找到了生成数据集的 optim,但到目前为止已经能够增加 R optim 源代码中的参数或使用 Python scipy.optimize 或类似方法复制它,尽管这些仍然依赖于方法而不是根据我的需要,根据我的 10 个参数直接伪随机创建一个数据集;

I have found optim in R which generates a data set, but have so far been able to increase the parameters in the R optim source code or duplicate it with Python scipy.optimize or similar, though these still depend on methods instead of directly psudo-randomly creating a data set according to my 10 parameters as I need to;

m0 <- 20
sd0 <- 5
min <- 1
max <- 45
n <- 15
set.seed(1)
mm <- min:max
x0 <- sample(mm, size=n, replace=TRUE)
objfun <- function(x) {(mean(x)-m0)^2+(sd(x)-sd0)^2}
candfun <- function(x) {x[sample(n, size=1)] <- sample(mm, size=1)
    return(x)}
objfun(x0) ##INITIAL RESULT:83.93495
o1 <- optim(par=x0, fn=objfun, gr=candfun, method="SANN", control=list(maxit=1e6))
mean(o1$par) ##INITIAL RESULT:20
sd(o1$par) ##INITIAL RESULT:5
plot(table(o1$par))

推荐答案

按照分布生成随机数的最通用方法如下:

The most general way to generate a random number following a distribution is as follows:

  • 生成一个以 0 和 1 为界的统一随机数(例如,numpy.random.random()).
  • 取该数字的逆 CDF(逆累积分布函数).

结果是一个服从分布的数字.

The result is a number that follows the distribution.

在您的情况下,逆 CDF (ICDF(x)) 已经由您的五个参数确定 - 最小值、最大值和三个百分位数,如下所示:

In your case, the inverse CDF (ICDF(x)) is determined already by five of your parameters -- the minimum, maximum, and three percentiles, as follows:

  • ICDF(0) = 最小值
  • ICDF(0.25) = 第 25 个百分位
  • ICDF(0.5) = 第 50 个百分位数
  • ICDF(0.75) = 第 75 个百分位
  • ICDF(1) = 最大值

因此,您已经对逆 CDF 的外观有所了解.您现在要做的就是以某种方式优化其他参数(均值、标准差、偏度和峰度)的逆 CDF.例如,您可以填写"其他百分位数处的逆 CDF,并查看它们与您要寻找的参数的匹配程度.从这个意义上说,一个好的起始猜测是对刚才提到的百分位数进行线性插值.要记住的另一件事是逆 CDF永远不会下降".

Thus, you already have some idea of how the inverse CDF looks like. And all you have to do now is somehow optimize the inverse CDF for the other parameters (mean, standard deviation, skewness, and kurtosis). For example, you can "fill in" the inverse CDF at the other percentiles and see how well they match the parameters you're going after. In this sense, a good starting guess is a linear interpolation of the percentiles just mentioned. Another thing to keep in mind is that the inverse CDF "can never go down".

以下代码显示了一个解决方案.它执行以下步骤:

The following code shows a solution. It does the following steps:

  • 它通过线性插值计算逆 CDF 的初始猜测.初始猜测值由该函数在 101 个均匀分布的点处的值组成,包括上面提到的五个百分位数.
  • 它设置了优化的界限.除了五个百分位数外,优化在任何地方都受到最小值和最大值的限制.
  • 它设置了其他四个参数.
  • 然后将目标函数 (_lossfunc)、初始猜测、边界和其他参数传递给 SciPy 的 scipy.optimize.minimize 方法进行优化.立>
  • 优化完成后,代码会检查是否成功,如果不成功则引发错误.
  • 如果优化成功,代码会计算最终结果的逆 CDF.
  • 它生成 N 个统一的随机值.
  • 它使用逆 CDF 转换这些值并返回这些值.
  • It calculates an initial guess for the inverse CDF via a linear interpolation. The initial guess consists of the values of that function at 101 evenly spaced points, including the five percentiles mentioned above.
  • It sets up the bounds of the optimization. The optimization is bounded by the minimum and maximum values everywhere except at the five percentiles.
  • It sets up the other four parameters.
  • It then passes the objective function (_lossfunc), the initial guess, the bounds and the other parameters to SciPy's scipy.optimize.minimize method for optimization.
  • Once the optimization finishes, the code checks for success and raises an error if unsuccessful.
  • If the optimization succeeds, the code calculates an inverse CDF for the final result.
  • It generates N uniform random values.
  • It transforms those values with the inverse CDF and returns those values.
import scipy.stats.mstats as mst
from scipy.optimize import minimize
from scipy.interpolate import interp1d
import numpy

# Define the loss function, which compares the calculated
# and ideal parameters
def _lossfunc(x, *args):
    mean, stdev, skew, kurt, chunks = args
    st = (
        (numpy.mean(x) - mean) ** 2
        + (numpy.sqrt(numpy.var(x)) - stdev) ** 2
        + ((mst.skew(x) - skew)) ** 2
        + ((mst.kurtosis(x) - kurt)) ** 2
    )
    return st

def adjust(rx, percentiles):
    eps = (max(rx) - min(rx)) / (3.0 * len(rx))
    # Make result monotonic
    for i in range(1, len(rx)):
        if (
            i - 2 >= 0
            and rx[i - 2] < rx[i - 1]
            and rx[i - 1] >= rx[i]
            and rx[i - 2] < rx[i]
        ):
            rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
        elif rx[i - 1] >= rx[i]:
            rx[i] = rx[i - 1] + eps
    # Constrain to percentiles
    for pi in range(1, len(percentiles)):
        previ = percentiles[pi - 1][0]
        prev = rx[previ]
        curr = rx[percentiles[pi][0]]
        prevideal = percentiles[pi - 1][1]
        currideal = percentiles[pi][1]
        realrange = max(eps, curr - prev)
        idealrange = max(eps, currideal - prevideal)
        for i in range(previ + 1, percentiles[pi][0]):
            if rx[i] >= currideal or rx[i] <= prevideal:
              rx[i] = (
                  prevideal
                  + max(eps * (i - previ + 1 + 1), rx[i] - prev) * idealrange / realrange
              )
        rx[percentiles[pi][0]] = currideal
    # Make monotonic again
    for pi in range(1, len(percentiles)):
        previ = percentiles[pi - 1][0]
        curri = percentiles[pi][0]
        for i in range(previ+1, curri+1):
          if (
            i - 2 >= 0
            and rx[i - 2] < rx[i - 1]
            and rx[i - 1] >= rx[i]
            and rx[i - 2] < rx[i]
            and i-1!=previ and i-1!=curri
          ):
            rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
          elif rx[i - 1] >= rx[i] and i!=curri:
            rx[i] = rx[i - 1] + eps
    return rx

# Calculates an inverse CDF for the given nine parameters.
def _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt, chunks=100):
    if chunks < 0:
        raise ValueError
    # Minimum of 16 chunks
    chunks = max(16, chunks)
    # Round chunks up to closest multiple of 4
    if chunks % 4 != 0:
        chunks += 4 - (chunks % 4)
    # Calculate initial guess for the inverse CDF; an
    # interpolation of the inverse CDF through the known
    # percentiles
    interp = interp1d([0, 0.25, 0.5, 0.75, 1.0], [mn, p25, p50, p75, mx], kind="cubic")
    rnge = mx - mn
    x = interp(numpy.linspace(0, 1, chunks + 1))
    # Bounds, taking percentiles into account
    bounds = [(mn, mx) for i in range(chunks + 1)]
    percentiles = [
        [0, mn],
        [int(chunks * 1 / 4), p25],
        [int(chunks * 2 / 4), p50],
        [int(chunks * 3 / 4), p75],
        [int(chunks), mx],
    ]
    for p in percentiles:
        bounds[p[0]] = (p[1], p[1])
    # Other parameters
    otherParams = (mean, stdev, skew, kurt, chunks)
    # Optimize the result for the given parameters
    # using the initial guess and the bounds
    result = minimize(
        _lossfunc,  # Loss function
        x,  # Initial guess
        otherParams,  # Arguments
        bounds=bounds,
    )
    rx = result.x
    if result.success:
        adjust(rx, percentiles)
        # Minimize again
        result = minimize(
            _lossfunc,  # Loss function
            rx,  # Initial guess
            otherParams,  # Arguments
            bounds=bounds,
        )
        rx = result.x
        adjust(rx, percentiles)
        # Minimize again
        result = minimize(
            _lossfunc,  # Loss function
            rx,  # Initial guess
            otherParams,  # Arguments
            bounds=bounds,
        )
        rx = result.x
    # Calculate interpolating function of result
    ls = numpy.linspace(0, 1, chunks + 1)
    success = result.success
    icdf=interp1d(ls, rx, kind="linear")
    # == To check the quality of the result
    if False:
       meandiff = numpy.mean(rx) - mean
       stdevdiff = numpy.sqrt(numpy.var(rx)) - stdev
       print(meandiff)
       print(stdevdiff)
       print(mst.skew(rx)-skew)
       print(mst.kurtosis(rx)-kurt)
       print(icdf(0)-percentiles[0][1])
       print(icdf(0.25)-percentiles[1][1])
       print(icdf(0.5)-percentiles[2][1])
       print(icdf(0.75)-percentiles[3][1])
       print(icdf(1)-percentiles[4][1])
    return (icdf, success)

def random_10params(n, mn, p25, p50, p75, mx, mean, stdev, skew, kurt):
   """ Note: Kurtosis as used here is Fisher's kurtosis, 
     or kurtosis excess. Stdev is square root of numpy.var(). """
   # Calculate inverse CDF
   icdf, success = (None, False)
   tries = 0
   # Try up to 10 times to get a converging inverse CDF, increasing the mesh each time
   chunks = 500
   while tries < 10:
      icdf, success = _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt,chunks=chunks)
      tries+=1
      chunks+=100
      if success: break
   if not success:
     print("Warning: Estimation failed and may be inaccurate")
   # Generate uniform random variables
   npr=numpy.random.random(size=n)
   # Transform them with the inverse CDF
   return icdf(npr)

示例:

print(random_10params(n=1000, mn=39, p25=116, p50=147, p75=186, mx=401, mean=154.1207, stdev=52.3257, skew=.7083, kurt=.5383))

<小时>

最后一点:如果您可以访问底层数据点,而不仅仅是它们的统计数据,还有 其他方法 你可以用来从分布中采样这些数据点形式.示例包括核密度估计直方图回归模型(特别是对于时间序列数据).另请参阅根据现有数据生成随机数据.


One last note: If you have access to the underlying data points, rather than just their statistics, there are other methods you can use to sample from the distribution those data points form. Examples include kernel density estimations, histograms, or regression models (particularly for time series data). See also Generate random data based on existing data.

这篇关于Python 如何使用 n、min、max、mean、std、25%、50%、75%、Skew、Kurtosis 来定义伪随机概率密度估计/函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆