问题在C precision浮点运算 [英] Problem with Precision floating point operation in C

查看:155
本文介绍了问题在C precision浮点运算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关我开始C.执行朴素贝叶斯分类我的项目是实现使用大量的训练数据的文档分类应用程序(尤其是垃圾邮件)。我的课程项目之一

For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data.

现在我有问题,由于实施在C的数据类型的局限性算法。

Now I have problem implementing the algorithm because of the limitations in the C's datatype.

(我用在这里给出算法, http://en.wikipedia.org/wiki/Bayesian_spam_filtering

( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering )

问题描述:
该算法包括获取每个字的文件中,计算它是垃圾邮件单词的概率。如果P1,P2 P3 ....的pn都是字1,2,3的概率... N。文档是垃圾邮件或不概率的计算公式

PROBLEM STATEMENT: The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using

下面,概率值可在0.01变得非常容易。所以,即使我使用的数据类型双我估计会去折腾。为了证实这一点,我写了下面给出一个样本code。

Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below.

#define PROBABILITY_OF_UNLIKELY_SPAM_WORD     (0.01)
#define PROBABILITY_OF_MOSTLY_SPAM_WORD     (0.99)

int main()
{
    int index;
    long double numerator = 1.0;
    long double denom1 = 1.0, denom2 = 1.0;
    long double doc_spam_prob;

    /* Simulating FEW unlikely spam words  */
    for(index = 0; index < 162; index++)
    {
        numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
        denom2    = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
        denom1    = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD);
    }
    /* Simulating lot of mostly definite spam words  */
    for (index = 0; index < 1000; index++)
    {
        numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
        denom2    = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
        denom1    = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD);
    }
    doc_spam_prob= (numerator/(denom1+denom2));
    return 0;
}

我试过浮动,双连长双数据类型,但还是同样的问题。

I tried Float, double and even long double datatypes but still same problem.

因此​​,说在10万字记录我分析,如果仅仅162字有1%的垃圾邮件概率和剩余的99838是明显的垃圾邮件的话,那还是我的应用程序将它说的那样,因为precision不是垃圾邮件文档错误(如分子很容易变为零)!!!

Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!.

这是我第一次打这样的问题。那么究竟是如何应该这样问题才能被解决?

This is the first time I am hitting such issue. So how exactly should this problem be tackled?

推荐答案

因为你收集了太多的方面,而不为他们的大小方面您的问题而造成的。一种解决方案是采取对数。另一种是你的个别条款进行排序。首先,让我们改写方程 1 / P = 1 +Π((1-p_i)/ p_i)。现在你的问题是,一些术语都很小,有的则是大。如果连续有太多的小项,你会下溢,和太多的大方面,你会溢出中间结果。

Your problem is caused because you are collecting too many terms without regard for their size. One solution is to take logarithms. Another is to sort your individual terms. First, let's rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i). Now your problem is that some of the terms are small, while others are big. If you have too many small terms in a row, you'll underflow, and with too many big terms you'll overflow the intermediate result.

所以,不要放太多连续的顺序相同的。排序条件(1-p_i)/ p_i 。其结果是,首先会是最小术语,最后最大。现在,如果你乘他们马上你仍然有一个溢。但计算的顺序并不重要。使用两个迭代器到您的临时征。一开始之初(即(1-p_0)/ p_0 ),另一个在年底(即(1-P_N)/ P_N ),和你的中间结果开始于 1.0 。现在,当你的中间结果是> = 1.0,你把一个词从正面看,当你的intemediate结果为&lt; 1.0你从后面的结果。

So, don't put too many of the same order in a row. Sort the terms (1-p_i)/p_i. As a result, the first will be the smallest term, the last the biggest. Now, if you'd multiply them straight away you would still have an underflow. But the order of calculation doesn't matter. Use two iterators into your temporary collection. One starts at the beginning (i.e. (1-p_0)/p_0), the other at the end (i.e (1-p_n)/p_n), and your intermediate result starts at 1.0. Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back.

其结果是,当你需要方面,中间结果会振荡1.0左右。它只会上升或为您运行的大或小的方面了。但是,这是确定。在这一点上,你消耗的两端极端,所以它的中间结果将慢慢接近最后的结果。

The result is that as you take terms, the intermediate result will oscillate around 1.0. It will only go up or down as you run out of small or big terms. But that's OK. At that point, you've consumed the extremes on both ends, so it the intermediate result will slowly approach the final result.

当然有溢出的现实可能性。如果输入的是完全不可能是垃圾邮件(p值= 1E-1000),那么 1 / P 会溢出,因为Π((1-p_i) / p_i)溢出。但由于条件进行排序,我们知道,中间结果将溢出的如果Π((1-p_i)/ p_i)溢出。因此,如果中间结果溢出,有precision的没有后续的损失。

There's of course a real possibility of overflow. If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. So, if the intermediate result overflows, there's no subsequent loss of precision.

这篇关于问题在C precision浮点运算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆