如何使用逆 CDF 在 Python 中随机采样对数正态数据并指定目标百分位数? [英] How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

查看：53 发布时间：2021/7/2 19:51:37 python random statistics probability-density cdf

本文介绍了如何使用逆 CDF 在 Python 中随机采样对数正态数据并指定目标百分位数?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从 Python 中的对数正态分布生成随机样本，该应用程序用于模拟网络流量.我想生成这样的样本:

模态样本结果为 320 (~10^2.5)
80% 的样本位于 100 到 1000(10^2 到 10^3)范围内

我的策略是使用逆 CDF(或我相信的 Smirnov 变换):

使用以 2.5 为中心的正态分布的 PDF 计算 10^x 的 PDF，其中 x ~ N(2.5,sigma).
计算上述分布的 CDF.
沿 0 到 1 的区间生成随机均匀数据.
使用逆 CDF 将随机均匀数据转换为所需的范围.

问题是，当我最后计算第 10 个和第 90 个百分位数时，我得到的数字完全错误.

这是我的代码:

%matplotlib 内联导入 matplotlib将熊猫导入为 pd将 numpy 导入为 np导入 matplotlib.pyplot 作为 plt导入 scipy.stats从 scipy.stats 导入规范# 找到 mu 和 sigma 的值，使 80% 的数据位于 2 到 3 的范围内亩=2.505西格玛 = 1/2.505norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)# 输出:(1.9934025, 3.01659743)# 生成正态分布PDFx = np.arange(16,128000, 16) # 这里线性间隔，有额外的范围，以便正确缩放 CDFx_log = np.log10(x)亩=2.505西格玛 = 1/2.505y = norm.pdf(x_log,loc=mu,scale=sigma)图, ax = plt.subplots()ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')x2 = (10**x_log) # x2 应该是线性间隔的，以便 cumsum 工作(稍后)图, ax = plt.subplots()ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')ax.set_xlim(0,2000)# 计算CDFy_CDF = np.cumsum(y)/np.cumsum(y).max()图, ax = plt.subplots()ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')ax.set_xlim(0,8000)# 生成随机均匀数据输入 = np.random.uniform(大小=10000)# 使用CDF作为查找表流量 = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]# 丢弃高点和低点交通=交通[(交通> = 32)&(流量 <= 8000)]# 检查百分位数np.percentile(traffic,10),np.percentile(traffic,90)

产生输出:

(223.99999999999997, 2480.0000000000009)

... 而不是我想看到的 (100, 1000).任何建议表示赞赏！

解决方案

首先，我不确定使用 PDF 以 2.5 为中心的正态分布.毕竟，对数正态是关于以 e 为底的对数(又名自然对数)，这意味着 320 = 10^2.5 = e^5.77.

其次，我会以不同的方式解决问题.您需要 m 和 s 从中采样对数正态.

如果您查看上面的 wiki 文章，您会发现它是双参数分布.你正好有两个条件:

Mode = exp(m - s*s) = 320[100,1000] 中的 80% 样本 =>CDF(1000,m,s) - CDF(100,m,s) = 0.8

其中 CDF 通过误差函数表示(这是在任何库中都可以找到的非常常见的函数)

两个参数的两个非线性方程.解决它们，找到m 和s 并将其放入任何标准对数正态采样

I'm trying to generate random samples from a lognormal distribution in Python, the application is for simulating network traffic. I'd like to generate samples such that:

The modal sample result is 320 (~10^2.5)
80% of the samples lie within the range 100 to 1000 (10^2 to 10^3)

My strategy is to use the inverse CDF (or Smirnov transform I believe):

Use the PDF for a normal distribution centred around 2.5 to calculate the PDF for 10^x where x ~ N(2.5,sigma).
Calculate the CDF for the above distribution.
Generate random uniform data along the interval 0 to 1.
Use the inverse CDF to transform the random uniform data into the required range.

The problem is, when I calculate the 10 and 90th percentile at the end, I have completely the wrong numbers.

Here is my code:

%matplotlib inline

import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import norm

# find value of mu and sigma so that 80% of data lies within range 2 to 3
mu=2.505
sigma = 1/2.505
norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
# output: (1.9934025, 3.01659743)

# Generate normal distribution PDF
x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
x_log = np.log10(x)
mu=2.505
sigma = 1/2.505
y = norm.pdf(x_log,loc=mu,scale=sigma)
fig, ax = plt.subplots()
ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')

x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
fig, ax = plt.subplots()
ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
ax.set_xlim(0,2000)

# Calculate CDF
y_CDF = np.cumsum(y) / np.cumsum(y).max()
fig, ax = plt.subplots()
ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
ax.set_xlim(0,8000)

# Generate random uniform data
input = np.random.uniform(size=10000)

# Use CDF as lookup table
traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]

# Discard highs and lows
traffic = traffic[(traffic >= 32) & (traffic <= 8000)]

# Check percentiles
np.percentile(traffic,10),np.percentile(traffic,90)

Which produces the output:

(223.99999999999997, 2480.0000000000009)

... and not the (100, 1000) that I would like to see. Any advice appreciated!

解决方案

First, I'm not sure about Use the PDF for a normal distribution centred around 2.5. After all, log-normal is about base e logarithm (aka natural log), which means 320 = 10^2.5 = e^5.77.

Second, I would approach problem in a different way. You need m and s to sample from Log-Normal.

If you look at wiki article above, you could see that it is two-parametric distribution. And you have exactly two conditions:

Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8

where CDF is expressed via error function (which is pretty much common function found in any library)

So two non-linear equations for two parameters. Solve them, find m and s and put it into any standard log-normal sampling

这篇关于如何使用逆 CDF 在 Python 中随机采样对数正态数据并指定目标百分位数?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用逆 CDF 在 Python 中随机采样对数正态数据并指定目标百分位数? [英] How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用逆 CDF 在 Python 中随机采样对数正态数据并指定目标百分位数? [英] How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭