Python:排序函数在 nan 存在的情况下中断 [英] Python: sort function breaks in the presence of nan

查看:37
本文介绍了Python:排序函数在 nan 存在的情况下中断的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

sorted([2, float('nan'), 1]) 返回 [2, nan, 1]

(至少在 Activestate Python 3.1 实现上.)

我知道 nan 是一个奇怪的对象,所以如果它出现在排序结果中的随机位置我不会感到惊讶.但是也搞乱了容器中非nan数的排序,实在是出乎意料.

我问了一个相关问题关于max,并基于此我理解为什么 sort 像这样工作.但这应该被视为错误吗?

文档只是说返回一个新的排序列表 [...]"而没有指定任何细节.

我现在同意这并不违反 IEEE 标准.然而,我认为,从任何常识的角度来看,这都是一个错误.即使是不经常承认错误的微软也承认这是一个错误,并在最新版本中修复了它:http://connect.microsoft.com/VisualStudio/feedback/details/363379/bug-in-list-double-sort-in-list-which-contains-double-nan.

无论如何,我最终遵循了@khachik 的回答:

sorted(list_, key = lambda x : float('-inf') if math.isnan(x) else x)

我怀疑与默认情况下这样做的语言相比,它会导致性能下降,但至少它是有效的(除非我引入了任何错误).

解决方案

前面的回答很有用,但可能不清楚问题的根源.

在任何语言中,排序都会在输入值的域上应用由比较函数或以其他方式定义的给定排序.例如,小于,也就是 operator <, 可以贯穿始终,当且仅当小于定义了输入值的合适排序.

但是对于浮点值和小于值,这尤其不正确:NaN 是无序的:它不等于、大于或小于任何东西,包括它自己."(来自 GNU C 手册的清晰散文,但适用于所有现代基于 IEEE754浮点)

所以可能的解决方案是:

<块引用>

  1. 首先删除 NaN,通过 < 定义好输入域.(或正在使用的其他排序功能)
  2. 定义一个自定义比较函数(又名谓词)定义 NaN 的排序,例如小于任何数字或大于比任何数字都要多.

任何一种语言都可以使用.

实际上,考虑到python,如果您不太关心最快的性能或者如果在上下文中删除NaN是一种理想的行为,我更愿意删除NaN.

否则,您可以在较旧的 Python 版本中通过cmp"或通过 this 和 functools.cmp_to_key() 使用合适的谓词函数.后者自然比首先删除 NaN 更尴尬.在定义这个谓词函数时,需要注意避免更糟糕的性能.

sorted([2, float('nan'), 1]) returns [2, nan, 1]

(At least on Activestate Python 3.1 implementation.)

I understand nan is a weird object, so I wouldn't be surprised if it shows up in random places in the sort result. But it also messes up the sort for the non-nan numbers in the container, which is really unexpected.

I asked a related question about max, and based on that I understand why sort works like this. But should this be considered a bug?

Documentation just says "Return a new sorted list [...]" without specifying any details.

EDIT: I now agree that this isn't in violation of the IEEE standard. However, it's a bug from any common sense viewpoint, I think. Even Microsoft, which isn't known to admit their mistakes often, has recognized this one as a bug, and fixed it in the latest version: http://connect.microsoft.com/VisualStudio/feedback/details/363379/bug-in-list-double-sort-in-list-which-contains-double-nan.

Anyway, I ended up following @khachik's answer:

sorted(list_, key = lambda x : float('-inf') if math.isnan(x) else x)

I suspect it results in a performance hit compared to the language doing that by default, but at least it works (barring any bugs that I introduced).

解决方案

The previous answers are useful, but perhaps not clear regarding the root of the problem.

In any language, sort applies a given ordering, defined by a comparison function or in some other way, over the domain of the input values. For example, less-than, a.k.a. operator <, could be used throughout if and only if less than defines a suitable ordering over the input values.

But this is specifically NOT true for floating point values and less-than: "NaN is unordered: it is not equal to, greater than, or less than anything, including itself." (Clear prose from GNU C manual, but applies to all modern IEEE754 based floating point)

So the possible solutions are:

  1. remove the NaNs first, making the input domain well defined via < (or the other sorting function being used)
  2. define a custom comparison function (a.k.a. predicate) that does define an ordering for NaN, such as less than any number, or greater than any number.

Either approach can be used, in any language.

Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context.

Otherwise you could use a suitable predicate function via "cmp" in older python versions, or via this and functools.cmp_to_key(). The latter is a bit more awkward, naturally, than removing the NaNs first. And care will be required to avoid worse performance, when defining this predicate function.

这篇关于Python:排序函数在 nan 存在的情况下中断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆