再presenting连续概率分布 [英] Representing continuous probability distributions

查看：157 发布时间：2015/11/30 15:52:24 algorithm math haskell statistics probability

本文介绍了再presenting连续概率分布的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

予有涉及的连续概率分布函数，其中大部分是凭经验确定（例如出发时间，运输时间）的集合中的问题。我需要的是采取两个这样的PDF文件，并进行算术运算的一些方法。例如。如果我有两个值x取自PDF X，和PDF Y Y取，我需要得到的PDF为（X + Y），或任何其他操作F（X，Y）。

I have a problem involving a collection of continuous probability distribution functions, most of which are determined empirically (e.g. departure times, transit times). What I need is some way of taking two of these PDFs and doing arithmetic on them. E.g. if I have two values x taken from PDF X, and y taken from PDF Y, I need to get the PDF for (x+y), or any other operation f(x,y).

这是解析解是不可能的，所以我正在寻找的是PDF文件的一些重新presentation，允许这样的事情。一个明显的（但昂贵的计算）解决方案是蒙特卡罗：产生大量的x和y的值，然后仅仅测量F（X，Y）。但是，这需要太多的CPU时间。

An analytical solution is not possible, so what I'm looking for is some representation of PDFs that allows such things. An obvious (but computationally expensive) solution is monte-carlo: generate lots of values of x and y, and then just measure f(x, y). But that takes too much CPU time.

我确实想重新presenting一份PDF文件范围的列表，其中每个系列有一个大致相同的概率，有效地重新presenting的PDF作为均匀分布的列表的联合。但我看不出如何将它们结合起来。

I did think about representing the PDF as a list of ranges where each range has a roughly equal probability, effectively representing the PDF as the union of a list of uniform distributions. But I can't see how to combine them.

有没有人有什么好的办法解决这一问题呢？

Does anyone have any good solutions to this problem?

编辑：的目标是创建一个小型的语言（又名领域特定语言），用于操纵PDF文件。但首先我需要理清底层重新presentation和算法。

The goal is to create a mini-language (aka Domain Specific Language) for manipulating PDFs. But first I need to sort out the underlying representation and algorithms.

编辑2： dmckee显示直方图的实现。这就是我和我的制服分布表得到的。但我不明白如何将它们结合起来，以创造新的分配。最终我需要找到的东西如P，（X＆LT; Y）的情况下，这可能相当小

Edit 2: dmckee suggests a histogram implementation. That is what I was getting at with my list of uniform distributions. But I don't see how to combine them to create new distributions. Ultimately I need to find things like P(x < y) in cases where this may be quite small.

修改3：我有一堆直方图。他们不是均匀分布的，因为我是从次数数据生成它们，所以基本上如果我有100个样本，我想十个点的直方图，然后我分配10个样本每栏，使酒吧可变宽度但不断的地区。

Edit 3: I have a bunch of histograms. They are not evenly distributed because I'm generating them from occurance data, so basically if I have 100 samples and I want ten points in the histogram then I allocate 10 samples to each bar, and make the bars variable width but constant area.

我已经想通了，加你卷积他们的PDF文件，我已经去骨了对数学的这一点。当你卷板两台同类发行版，你得到一个新的分布具有三个部分：更广泛的均匀分布仍然存在，但有一个三角形贴在两侧的窄之一的宽度。所以，如果我卷板X和Y的每一个元素，我会得到这些一堆，所有的重叠。现在，我试图找出如何总结所有这些，然后让一个直方图是它的最佳逼近。

I've figured out that to add PDFs you convolve them, and I've boned up on the maths for that. When you convolve two uniform distributions you get a new distribution with three sections: the wider uniform distribution is still there, but with a triangle stuck on each side the width of the narrower one. So if I convolve each element of X and Y I'll get a bunch of these, all overlapping. Now I'm trying to figure out how to sum them all and then get a histogram that is the best approximation to it.

我开始怀疑蒙特卡罗是不是一个坏主意。

I'm beginning to wonder if Monte-Carlo wasn't such a bad idea after all.

修改4： 本文讨论均匀分布的卷积在一些细节。一般来说，你得到一个梯形的分布。由于在直方图每个列是均匀分布的，我希望这一问题可以通过卷积这些列和总结的结果来解决。

Edit 4: This paper discusses convolutions of uniform distributions in some detail. In general you get a "trapezoid" distribution. Since each "column" in the histograms is a uniform distribution, I had hoped that the problem could be solved by convolving these columns and summing the results.

然而，结果是显着大于输入更复杂，同时还包括三角形。 修改5： [错误的东西去掉。但是，如果这些梯形是近似于矩形与同一地区，那么你找到正确答案，并减少在结果矩形的数量看起来pretty的直白了。这可能是解决方案，我一直在试图找到。

However the result is considerably more complex than the inputs, and also includes triangles. Edit 5: [Wrong stuff removed]. But if these trapezoids are approximated to rectangles with the same area then you get the Right Answer, and reducing the number of rectangles in the result looks pretty straightforward too. This might be the solution I've been trying to find.

修改6：解决了！这里是最后的哈斯克尔code这个问题：

Edit 6: Solved! Here is the final Haskell code for this problem:

-- | Continuous distributions of scalars are represented as a
-- | histogram where each bar has approximately constant area but
-- | variable width and height.  A histogram with N bars is stored as
-- | a list of N+1 values.
data Continuous = C {
      cN :: Int,
      -- ^ Number of bars in the histogram.
      cAreas :: [Double],
      -- ^ Areas of the bars.  @length cAreas == cN@
      cBars :: [Double]
      -- ^ Boundaries of the bars.  @length cBars == cN + 1@
    } deriving (Show, Read)


{- | Add distributions.  If two random variables @vX@ and @vY@ are
taken from distributions @x@ and @y@ respectively then the
distribution of @(vX + vY)@ will be @(x .+. y).

This is implemented as the convolution of distributions x and y.
Each is a histogram, which is to say the sum of a collection of
uniform distributions (the "bars").  Therefore the convolution can be
computed as the sum of the convolutions of the cross product of the
components of x and y.

When you convolve two uniform distributions of unequal size you get a
trapezoidal distribution. Let p = p2-p1, q - q2-q1.  Then we get:


>   |                              |
>   |     ______                   |
>   |     |    |           with    |  _____________
>   |     |    |                   |  |           |
>   +-----+----+-------            +--+-----------+-
>         p1   p2                     q1          q2
> 
>  gives    h|....... _______________
>            |       /:             :\
>            |      / :             : \                1
>            |     /  :             :  \     where h = -
>            |    /   :             :   \              q
>            |   /    :             :    \
>            +--+-----+-------------+-----+-----
>             p1+q1  p2+q1       p1+q2   p2+q2

However we cannot keep the trapezoid in the final result because our
representation is restricted to uniform distributions.  So instead we
store a uniform approximation to the trapezoid with the same area:

>           h|......___________________
>            |     | /               \ |
>            |     |/                 \|
>            |     |                   |
>            |    /|                   |\
>            |   / |                   | \
>            +-----+-------------------+--------
>               p1+q1+p/2          p2+q2-p/2

-}
(.+.) :: Continuous -> Continuous -> Continuous
c .+. d = C {cN     = length bars - 1,
             cBars  = map fst bars, 
             cAreas = zipWith barArea bars (tail bars)}
    where
      -- The convolve function returns a list of two (x, deltaY) pairs.
      -- These can be sorted by x and then sequentially summed to get
      -- the new histogram.  The "b" parameter is the product of the
      -- height of the input bars, which was omitted from the diagrams
      -- above.
      convolve b c1 c2 d1 d2 =
          if (c2-c1) < (d2-d1) then convolve1 b c1 c2 d1 d2 else convolve1 b d1 
d2 c1 c2
      convolve1 b p1 p2 q1 q2 = 
          [(p1+q1+halfP, h), (p2+q2-halfP, (-h))]
               where 
                 halfP = (p2-p1)/2
                 h = b / (q2-q1)
      outline = map sumGroup $ groupBy ((==) `on` fst) $ sortBy (comparing fst) 
$ concat
                [convolve (areaC*areaD) c1 c2 d1 d2 |
                 (c1, c2, areaC) <- zip3 (cBars c) (tail $ cBars c) (cAreas c),
                 (d1, d2, areaD) <- zip3 (cBars d) (tail $ cBars d) (cAreas d)
                ]
      sumGroup pairs = (fst $ head pairs, sum $ map snd pairs)

      bars = tail $ scanl (\(_,y) (x2,dy) -> (x2, y+dy)) (0, 0) outline
      barArea (x1, h) (x2, _) = (x2 - x1) * h

其他的运营商都将作为练习留给读者。

Other operators are left as an exercise for the reader.

推荐答案

无需直方图或符号计算：一切都可以做到在封闭的形式语言层面，如果采取了正确的观点

No need for histograms or symbolic computation: everything can be done at the language level in closed form, if the right point of view is taken.

[我将使用测量和分配一词互换。另外，我的Haskell是生锈的，我请你原谅我的IM precise在这个领域。] 的

概率分布是真正的 CODATA 。

让亩，是一个概率测度。可以将用唯一要做的是对集成测试功能（这是措施的一种可能的数学定义）。请注意，这是你最终会做的事：比如对身份整合，走的意思是：

Let mu be a probability measure. The only thing you can do with a measure is integrate it against a test function (this is one possible mathematical definition of "measure"). Note that this is what you will eventually do: for instance integrating against identity is taking the mean:

mean :: Measure -> Double
mean mu = mu id

另外一个例子：

another example:

variance :: Measure -> Double
variance mu = (mu $ \x -> x ^ 2) - (mean mu) ^ 2

另外一个例子，它计算P（亩＆LT; X）：

another example, which computes P(mu < x):

cdf :: Measure -> Double -> Double
cdf mu x = mu $ \z -> if z < x then 1 else 0

这表明，通过双重性的方法。

This suggests an approach by duality.

类型测量所以人表示键入（双 - ＆GT;双人间） - ＆GT;双。这使您可以模拟蒙特卡洛模拟，对PDF格式的数字/符号积分等。例如结果，功能

The type Measure shall therefore denote the type (Double -> Double) -> Double. This allows you to model results of MC simulation, numerical/symbolic quadrature against a PDF, etc. For instance, the function

empirical :: [Double] -> Measure
empirical h:t f = (f h) + empirical t f

返回˚F针对的的经验措施的通过，例如获得的积分。 MC采样。此外

returns the integral of f against an empirical measure obtained by eg. MC sampling. Also

from_pdf :: (Double -> Double) -> Measure
from_pdf rho f = my_favorite_quadrature_method rho f

从构造（常规）密度的措施。

construct measures from (regular) densities.

现在，好消息。如果亩，NU两种措施中，卷积的亩** NU 由下式给出：

Now, the good news. If mu and nu are two measures, the convolution mu ** nu is given by:

(mu ** nu) f = nu $ \y -> (mu $ \x -> f $ x + y)

因此，考虑到两项措施，您可以将反对他们的卷积。

So, given two measures, you can integrate against their convolution.

此外，由于法律的随机变量X 亩，A * X的规律由下式给出：

Also, given a random variable X of law mu, the law of a * X is given by:

rescale :: Double -> Measure -> Measure
rescale a mu f = mu $ \x -> f(a * x)

另外，披（X）的分布由图象量度给出的phi_ * X，在我们的框架：

Also, the distribution of phi(X) is given by the image measure phi_* X, in our framework:

apply :: (Double -> Double) -> Measure -> Measure
apply phi mu f = mu $ f . phi

所以，现在你可以很容易地制定出措施的嵌入式语言。还有更多的事情要做这里，特别是对样本空间比实际行以外，随机变量，conditionning之间的依赖关系，但我希望你明白了吧。

So now you can easily work out an embedded language for measures. There are much more things to do here, particularly with respect to sample spaces other than the real line, dependencies between random variables, conditionning, but I hope you get the point.

在具体地，pushforward是函子：

In particular, the pushforward is functorial:

newtype Measure a = (a -> Double) -> Double
instance Functor Measure a where
    fmap f mu = apply f mu

有一个单子太（运动 - 提示：这个非常貌似延续单子是什么返回什么是呼叫/毫升？）。

It is a monad too (exercise -- hint: this very much looks like the continuation monad. What is return ? What is the analog of call/cc ?).

此外，结合一个微分几何架构，这大概可以变成一些东西，自动计算贝叶斯后验分布

Also, combined with a differential geometry framework, this can probably be turned into something which compute Bayesian posterior distributions automatically.

在这一天结束时，你可以写这样的东西

At the end of the day, you can write stuff like

m = mean $ apply cos ((from_pdf gauss) ** (empirical data))

要计算COS（X + Y），其中X具有PDF 高斯和Y被采样由MC方法，其结果是数据。

to compute the mean of cos(X + Y) where X has pdf gauss and Y has been sampled by a MC method whose results are in data.

这篇关于再presenting连续概率分布的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

再presenting连续概率分布 [英] Representing continuous probability distributions

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

再presenting连续概率分布 [英] Representing continuous probability distributions

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭