在SQL数据库中存储权重的最佳做法? [英] Best practice for storing weights in a SQL database?

查看:352
本文介绍了在SQL数据库中存储权重的最佳做法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理的应用程序需要存储格式 X磅,y.y盎司的权重。数据库是MySQL,但我认为这是DB不可知的。



我可以想到三种方法:


  1. 将权重转换为十进制磅,并存储在单个字段中。 (5磅6.2盎司= 5.33671875磅)

  2. 将重量转换为十进制盎司并存储在单个字段中。 (5磅6.2盎司= 86.2盎司)

  3. 将磅部分作为整数存储,盎司部分作为十进制,位于两个字段中。

我认为#1不是一个好主意,因为十进制磅将产生任意精度的数字,这将需要作为浮点存储,这可能导致



有没有令人信服的理由选择#2超过#3或反之亦然?

解决方案


TL; DR



选择选项#1或选项#2&mdash他们之间没有区别。不要使用#3选项,因为它很难使用。


您声称浮点数存在固有的不准确之处。我认为这是值得一点解释。



当决定一个数字系统代表一个数字(无论是在一张纸上,在电脑电路或其他地方),有两个分开的问题需要考虑:


  1. 基础;和


  2. 格式




选择基数,任何基数…



受有限空间限制,不能代表无限集例如:无论您购买多少纸张或手写多少,都会永远可以找到不适合给定空间的整数(您可以继续附加额外的数字,直到纸张用尽)。因此,通过整数,我们通常将我们的有限空间限制为仅表示落在某个特定区间内的空间;例如,如果我们有三位数字,我们可能会将自己限制在间隔 [ - 999,999]



每个不为空的间隔包含无限集实际数字换句话说,无论什么时间间隔都可以接管实数,不管是$ code> [ - 999,999] [0,1] [0.000001,0.000002] 或其他任何内容,那个间隔!因此,任意实数必须始终被舍入到可以在 空格中表示的东西。



可以在有限空间中表示的实数集合取决于所使用的数字系统。在我们的(熟悉的)位置 base-10 系统,有限空间就足够了一半( 0.5 10 ),但不是三分之一( 0.33333… 10 );相比之下,在(较不熟悉的)位置 base-9 系统中,这是相反的(相同的号码分别是 0.44444&hellip ; 9 0.3 9 )。 非理性数字 总是需要标准位置系统中的无限空间。所有这些的结果是,一些可以使用位置基数10中的少量空间来表示的数字(并且因此出现对我们来说非常圆)实际上将需要无穷大二进制电路存储(因此不会对我们的数字朋友看起来非常圆)!



我们不能做更好的连续数量。最终,这样的数量必须在一些数字系统中使用有限的表示:无论该系统在计算机电路,人的手指,别的东西上还是没有任何东西所有—无论使用哪种系统,值必须舍入,因此总是会导致表示错误。



<换句话说,即使有一个完全精确的测量仪器(这在物理上是不可能的),那么它报告的任何测量都将已经被舍入到符合其显示的数字(<在任何基础上使用 - 通常是十进制,出于明显的原因)。所以,86.2盎司从来没有实际上是86.2盎司,而是86.00000 ...盎司和86.2499999 ...盎司之间的某种表示。 (实际上,因为在现实中这个工具是不完美的,所以我们可以说真的,我们有一些信心度实际价值在该间隔内 - 但这绝对是从这里开始的一些方式)。



但是我们可以做得更好离散量。这样的值不是任意实数,因此上述都不适用于它们:它们可以在它们被定义的数字系统中被精确地表示,实际上应该(转换为另一个数字系统并截断到有限长度将导致四舍五入到一个不精确的数字)。计算机可以(低效地)通过将数字表示为字符串来处理这种情况:例如请 ASCII BCD 编码。



应用格式…



由于它是数字系统(有些任意)基础的财产,所以无论值是否显示为圆,与其精确度 无关。这是一个真正重要的观察,这与许多人的直觉背道而驰(这也是我花了这么多时间解释数字基础的原因)。



有效数字的表示方式
。我们需要一种能够将我们的价值记录至少的存储格式,就像我们认为它们正确的许多重要数字一样。以code> 86.2 和 0.0000862 表示为例,我们认为是正确的示例值,两个最常见的选项是:




  • 固定点,其中有效数字的数量取决于大小:例如在固定的5位小数点表示中,我们的值将被存储为 86.20000 0.00009 (因此有7和分别为1个精度数字)。在这个例子中,在后一个值中,精确度已经丢失了,而且我们完全无法代表任何的意义并不是太多了, ;而前一个值存储在错误精确度 中,这是浪费我们的有限的空间(实际上,这个价值变得如此之大,以至于它溢出了存储容量)并不算太多。



    这个格式的一个常见的例子对于会计系统来说,可能是适当的:不管货币总和如何,货币通常都必须追踪到便士(因此,对于小价值而言,需要更少的精度,但是对于大价值而言需要更高的精度)。如果发生这种情况,货币通常也被认为是离散的(便士是不可分割的),所以这也是一个很好的例子,其中特定的基础(大多数现代货币的十进制)是希望的,以避免上面讨论的表示错误。 / p>


    一个通常通过将一个值作为公分子的商品来处理固定点存储,并将分子存储为整数。在我们的示例中,公分母可以是10 5 ,所以而不是 86.20000 0.00009 一个将存储整数 8620000 9 ,并记住它们必须除以 100000



  • 浮点数不论大小如何,都是不变的在五位数十进制表示中,我们的值将被存储为 86.200 0.000086200 (根据定义,每次都有5个重要的精确数字)。在这个例子中,这两个值都已被保存,而不会精确地损失;并且它们都具有相同数量的错误精度,这是较少的浪费(因此,我们可以使用我们的有限空间来表示更大范围的价值观,无论大小)。



    这种格式可能适合的常见示例是记录任何现实世界的测量:测量仪器的精度(它们都受到系统随机错误)是相当恒定的,无论规模如何,所以,给定足够的有效数字(通常大约3或4位数字),绝对没有精度丢失即使基数的变化导致四舍五入一个不同的数字


    一个通常通过将一个值视为整数有效值为整数指数。在我们的示例中,对于两个值,有效位数可以是 86200 ,因此(base-10)指数将为 -4 -9


    但是,我们的电脑使用的点存储格式





    最重要的事情是,这些格式分别超过一万次,超过一万亿次次更精确比说86.2,即使他们的二进制代表发生在数字中, 在十进制中更少的精确(更多在这个短期内)!




还要注意, 浮点格式会导致精度下降价值比格式支持更准确。 舍入错误可以传播在算术运算中产生明显错误的结果(这无疑解释了您对浮点数的固有不准确性的引用):例如, 1 3 ×在5位固定点中的3000 将产生 999.99000 而不是 1000.00000 ;和 10 81 −在5个有效数字浮点数中, 3 25 将产生 0.0034600 而不是 0.0034568



数值分析致力于理解这些影响,但重要的是要意识到任何可用系统(即使在您的头部执行计算)也容易受到这种影响问题因为无法保证终止的计算方法无法提供无限精度:例如,考虑如何计算圆的面积,必然会导致使用的值的精度损失



结论




  1. 现实世界的测量应该使用二进制浮点数:它的快速,紧凑,非常精确,不会比其他任何事情都更糟(包括你开始的十进制版本)。由于 MySQL的浮点数据类型是IEEE754,这正是什么他们提供。


  2. 货币应用程序应使用拒绝固定点:虽然缓慢而浪费内存,但确保这两个值不是四舍五入到不精确的数量,这笔便士不会大量的金额丢失。由于 MySQL的固定点数据类型是BCD编码的字符串,因此正是他们提供的。


最后,请记住,大多数编程语言使用二进制代表分数值浮点型类型:即使您的数据库以其他格式存储值,也可能会在应用程序代码的界面上转换(随之而来的所有问题)。



在这种情况下,哪个选项最好?



希望我相信你的价值观可以安全(而且)存储在浮点类型中,而不用担心任何不准确?记住,他们比你的轻微的3位数十进制表示更精确:你只需要忽略错误的精度(但是一定要总是即使使用固定点十进制格式)



对于您的问题:通过选项3选择选项1或2,使比较更容易(例如,找到最大质量,可以使用 MAX(mass),而要在两列中有效地执行此操作需要一些嵌套)。


$一般来说,在这两个选项之间,选择哪一个选项并不重要;浮点数以恒定数量的有效位存储,而不管其大小< (实际上,这可能是一些值被四舍五入到使用选项1更接近原始十进制表示的数字,同时其他值被四舍五入为使用选项2更接近于原始十进制表示的数字:它仅仅取决于如何每个特定值都可以用二进制表示)。



这个案例中,因为发生了16盎司到1磅(和16是2的幂,原始十进制值与使用两种方法存储的数字之间的相对差异是相同的


  1. 5.3875 10 (not 5.33671875 10 code>如您的问题所述)将被存储在一个二进制32的浮点中,作为 101.011000110011001100110 2 (这是 5.38749980926513671875 10 ):这是从原始值(但如上所述,原始价值)为 0.0000036% />

    知道一个binary32 float只存储精确度的7位十进制数,我们的编译器知道为某些,所有从第8位开始的所有内容都是绝对错误的精度,因此必须 我们的输入值不需要更多的精度(如果是这样,binary32显然是错误的格式选择),这个保证返回一个十进制值,看起来像我们开始: 5.387500 10 。但是,我们应该在这一点上应该域知识(我们应该使用任何存储格式)放弃可能存在的任何进一步的错误精确度,例如这两个尾随零。


  2. 86.2 10 / code>将被存储在一个二进制32浮点中,作为 1010110.00110011001100110 2 (这是 86.1999969482421875 10 ):这也是原始值的 0.0000036%。如前所述,我们忽略错误的精度。


注意数字的二进制表示如何相同, 小数点(相距四位)的位置:

 
101.0110 00110011001100110
101 0110.00110011001100110

这是因为5.3875× 2 4 = 86.2。



除了:作为欧洲(虽然是英国人),我也对皇家单位有强烈的厌恶的测量和处理不同尺度的值只是因为凌乱。我几乎肯定会在 SI单位(例如千克或克)中存储群众,然后执行转换我的应用程序的表示层中所需的英制单位。加上坚定的SI单位可能有一天可以从损失1.25亿美元


An application I'm working on needs to store weights of the format X pounds, y.y ounces. The database is MySQL, but I imagine this is DB agnostic.

I can think of three ways to do this:

  1. Convert the weight to decimal pounds and store in a single field. (5 lbs 6.2 oz = 5.33671875 lbs)
  2. Convert the weight to decimal ounces and store in a single field. (5 lbs 6.2 oz = 86.2 oz)
  3. Store the pounds portion as an integer and the ounces portion as a decimal, in two fields.

I'm thinking that #1 is not such a good idea, since decimal pounds will produce numbers of arbitrary precision, which would need to be stored as a float, which could lead to inaccuracies which are inherent in floating point numbers.

Is there a compelling reason to choose #2 over #3 or vise-versa?

解决方案

TL;DR

Choose either option #1 or option #2—there's no difference between them. Don't use option #3, because it's awkward to work with.

You claim that there are inherent inaccuracies in floating point numbers. I think that this deserves a little explanation.

When deciding upon a numeral system for representing a number (whether on a piece of paper, in a computer circuit, or elsewhere), there are two separate issues to consider:

  1. its basis; and

  2. its format.

Pick a base, any base…

Limited by finite space, one cannot represent an arbitrary member of an infinite set. For example: no matter how much paper you buy or how small your handwriting, it'd always be possible to find an integer that won't fit in the given space (you could just keep appending extra digits until the paper runs out). So, with integers, we usually restrict our finite space to representing only those that fall within some particular interval—e.g. if we have space for three digits, we might restrict ourselves to the interval [-999,999].

Every non-empty interval contains an infinite set of real numbers. In other words, no matter what interval one takes over the real numbers—be it [-999,999], [0,1], [0.000001,0.000002] or anything else—there is still an infinite set of reals within that interval! Therefore arbitrary real numbers must always be "rounded" to something that can be represented in finite space.

The set of real numbers that can be represented in finite space depends upon the numeral system that is used. In our (familiar) positional base-10 system, finite space will suffice for one-half (0.510) but not for one-third (0.33333…10); by contrast, in the (less familiar) positional base-9 system, it is the other way around (those same numbers are respectively 0.44444…9 and 0.39). Irrational numbers always require infinite space in standard positional systems. The consequence of all this is that some numbers that can be represented using only a small amount of space in positional base-10 (and therefore appear to be very "round" to us humans) would actually require infinite binary circuits for storage (and therefore don't appear to be very "round" to our digital friends)!

We can't do any better for continuous quantities. Ultimately such quantities must use a finite representation in some numeral system: it's arbitrary whether that system happens to be easy on computer circuits, on human fingers, on something else or on nothing at all—whichever system is used, the value must be rounded and therefore it always results in "representation error".

In other words, even if one has a perfectly precise measuring instrument (which is physically impossible), then any measurement it reports will already have been rounded to a number that happens to fit on its display (in whatever base it uses—typically decimal, for obvious reasons). So, "86.2 oz" is never actually "86.2 oz" but rather a representation of "something between 86.1500000... oz and 86.2499999... oz". (Actually, because in reality the instrument is imperfect, all we can ever really say is that we have some degree of confidence that the actual value falls within that interval—but that is definitely departing some way from the point here).

But we can do better for discrete quantities. Such values are not "arbitrary real numbers" and therefore none of the above applies to them: they can be represented exactly in the numeral system in which they were defined—and indeed, should be (as converting to another numeral system and truncating to a finite length would result in rounding to an inexact number). Computers can (inefficiently) handle such situations by representing the number as a string: e.g. consider ASCII or BCD encoding.

Apply a format…

Since it's a property of the numeral system's (somewhat arbitrary) basis, whether or not a value appears to be "round" has no bearing on its precision. That's a really important observation, which runs counter to many people's intuition (and it's the reason I spent so much time explaining numerical basis above).

Precision is instead determined by how many significant figures a representation has. We need a storage format that is capable of recording our values to at least as many significant figures as we consider them to be correct. Taking by way of example values that we consider to be correct when stated as 86.2 and 0.0000862, the two most common options are:

  • Fixed point, where the number of significant figures depends on magnitude: e.g. in fixed 5-decimal-point representation, our values would be stored as 86.20000 and 0.00009 (and therefore have 7 and 1 significant figures of precision respectively). In this example, precision has been lost in the latter value (and indeed, it wouldn't take much more for us to have been totally unable to represent anything of significance); and the former value stored false precision, which is a waste of our finite space (and indeed, it wouldn't take much more for the value to become so large that it overflows the storage capacity).

    A common example of when this format might be appropriate is for an accounting system: currency must usually be tracked to the penny irrespective of the monetary sum (therefore less precision is required for small values, but greater precision is required for large values). As it happens, currency is usually also considered to be discrete (pennies are indivisible), so this is also a good example of a situation where a particular basis (decimal for most modern currencies) is desirable to avoid the representation errors discussed above.

    One usually implements fixed point storage by treating one's values as quotients over a common denominator and storing the numerator as an integer. In our example, the common denominator could be 105, so instead of 86.20000 and 0.00009 one would store the integers 8620000 and 9 and remember that they must be divided by 100000.

  • Floating point, where the number of significant figures is constant irrespective of magnitude: e.g. in 5-significant-figure decimal representation, our values would be stored as 86.200 and 0.000086200 (and, by definition, have 5 significant figures of precision both times). In this example, both values have been stored without any loss of precision; and they both also have the same amount of false precision, which is less wasteful (and we can therefore use our finite space to represent a far greater range of values—both large and small).

    A common example of when this format might be appropriate is for recording any real world measurements: the precision of measuring instruments (which all suffer from both systematic and random errors) is fairly constant irrespective of scale so, given sufficient significant figures (typically around 3 or 4 digits), absolutely no precision is lost even if a change of base resulted in rounding to a different number.

    One usually implements floating point storage by treating one's values as integer significands with integer exponents. In our example, the significand could be 86200 for both values whereupon the (base-10) exponent would be -4 and -9 respectively.

    But how precise are the floating point storage formats used by our computers?

    • An IEEE754 single precision (binary32) floating point number has 24 bits, or log10(224) (over 7) digits, of significance—i.e. it has a tolerance of less than ±0.000006%. In other words, it is more precise than saying "86.20000".

    • An IEEE754 double precision (binary64) floating point number has 53 bits, or log10(253) (almost 16) digits, of significance—i.e. it has a tolerance of just over ±0.00000000000001%. In other words, it is more precise than saying "86.2000000000000".

    The most important thing to realise is that these formats are, respectively, over ten thousand and over one trillion times more precise than saying "86.2"—even though their representations in binary happen to round to numbers that appear less "exact" in decimal (more on this shortly)!

Notice also that both fixed and floating point formats will result in loss of precision when a value is known more precisely than the format supports. Such rounding errors can propagate in arithmetic operations to yield apparently erroneous results (which no doubt explains your reference to the "inherent inaccuracies" of floating point numbers): for example, 13 × 3000 in 5-place fixed point would yield 999.99000 rather than 1000.00000; and 1081325 in 5-significant figure floating point would yield 0.0034600 rather than 0.0034568.

The field of numerical analysis is dedicated to understanding these effects, but it is important to realise that any usable system (even performing calculations in your head) is vulnerable to such problems because no method of calculation that is guaranteed to terminate can ever offer infinite precision: consider, for example, how to calculate the area of a circle—there will necessarily be loss of precision in the value used for π, which will propagate into the result.

Conclusion

  1. Real world measurements should use binary floating point: it's fast, compact, extremely precise and no worse than anything else (including the decimal version from which you started). Since MySQL's floating-point datatypes are IEEE754, this is exactly what they offer.

  2. Currency applications should use denary fixed point: whilst it's slow and wastes memory, it ensures both that values are not rounded to inexact quantities and that pennies are not lost on large monetary sums. Since MySQL's fixed-point datatypes are BCD-encoded strings, this is exactly what they offer.

Finally, bear in mind that most programming languages represent fractional values using binary floating-point types: so even if your database stores values in another format, they'll probably get converted (with all the ensuing issues that entails) at the interface with your application code.

Which option is best in this case?

Hopefully I've convinced you that your values can safely (and should) be stored in floating point types without worrying about any "inaccuracies"? Remember, they're more precise than your flimsy 3-significant-digit decimal representation ever was: you just have to ignore false precision (but one must always do that anyway, even if using a fixed-point decimal format).

As for your question: choose either option 1 or 2 over option 3—it makes comparisons easier (for example, to find the maximum mass, one could just use MAX(mass), whereas to do it efficiently across two columns would require some nesting).

Generally speaking, between those two options it wouldn't much matter which one chooses—floating point numbers are stored with a constant number of significant bits irrespective of their scale (indeed, it could be that some values are rounded to numbers that are closer to their original decimal representation using option 1 whilst simultaneously others are rounded to numbers that are closer to their original decimal representation using option 2: it simply depends how well each particular value can be represented in binary).

In this case, because it happens that there are 16 ounces to 1 pound (and 16 is a power of 2), the relative difference between the original decimal values and the numbers stored using the two approaches is identical:

  1. 5.387510 (not 5.3367187510 as stated in your question) would be stored in a binary32 float as 101.0110001100110011001102 (which is 5.3874998092651367187510): this is 0.0000036% from the original value (but, as discussed above, the "original value" was already a pretty lousy representation of the physical quantity it represents).

    Knowing that a binary32 float stores only 7 decimal digits of precision, our compiler knows for certain that everything from the 8th digit onwards is definitely false precision and therefore must be ignored in every case—thus, provided that our input value didn't require more precision than that (and if it did, binary32 was obviously the wrong choice of format), this guarantees a return to a decimal value that looks just as round as that from which we started: 5.38750010. However, we should really apply domain knowledge at this point (as we should with any storage format) to discard any further false precision that might exist, such as those two trailing zeroes.

  2. 86.210 would be stored in a binary32 float as 1010110.001100110011001102 (which is 86.199996948242187510): this is also 0.0000036% from the original value. As before, we then ignore false precision.

Notice how the binary representations of the numbers are identical, except for the placement of the radix point (which is four bits apart):

101.0110 00110011001100110
101 0110.00110011001100110

This is because 5.3875 × 24 = 86.2.

As an aside: being European (albeit British), I also have a strong aversion to imperial units of measurement—handling values of different scales is just so messy. I'd almost certainly store masses in SI units (e.g. kilograms or grams) and then perform conversions to imperial units as required within the presentation layer of my application. Plus rigidly adhering to SI units might one day save you from losing $125m.

这篇关于在SQL数据库中存储权重的最佳做法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆