了解pandas.read_csv()浮点解析 [英] Understanding pandas.read_csv() float parsing

查看:245
本文介绍了了解pandas.read_csv()浮点解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用pandas.read_csv从CSV读取概率时遇到问题;某些值使用> 1.0读取为浮点数.

特别是,我对以下行为感到困惑:

>>> pandas.read_csv(io.StringIO("column\n0.99999999999999998"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999999"))["column"][0]
1.0000000000000002
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000000"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000001"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000008"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000009"))["column"][0]
1.0000000000000002

默认的浮点解析行为似乎是非单调的,尤其是某些以0.9...开头的值被转换为严格大于1.0的浮点,从而引起问题,例如将它们输入sklearn.metrics时.

文档指出read_csv有一个参数float_precision可用于选择"C引擎应将哪个转换器用于浮点值",并将其设置为'high'确实可以解决我的问题.

但是,我想了解默认行为:

  1. 在哪里可以找到默认浮点转换器的源代码?
  2. 在哪里可以找到有关默认浮点转换器的预期行为的文档和其他可能的选择?
  3. 为什么最低有效位的单位数更改会跳过一个值?
  4. 为什么这根本不单调?

编辑重复的问题" :这不是重复的.我知道浮点数学的局限性.我特别询问的是Pandas中的默认解析机制,因为内置的float不会显示此行为:

>>> float("0.99999999999999999")
1.0

...和我找不到文档.

解决方案

@MaxU已经显示了解析器和相关令牌生成器的源代码

xstrtod的代码大致如下(转换为纯Python):

def xstrtod(p):
    number = 0.
    idx = 0
    ndecimals = 0

    while p[idx].isdigit():
        number = number * 10. + int(p[idx])
        idx += 1

    idx += 1

    while idx < len(p) and p[idx].isdigit():
        number = number * 10. + int(p[idx])
        idx += 1
        ndecimals += 1

    return number / 10**ndecimals

再现了您看到的问题":

print(xstrtod('0.99999999999999997'))  # 1.0
print(xstrtod('0.99999999999999998'))  # 1.0
print(xstrtod('0.99999999999999999'))  # 1.0000000000000002
print(xstrtod('1.00000000000000000'))  # 1.0
print(xstrtod('1.00000000000000001'))  # 1.0
print(xstrtod('1.00000000000000002'))  # 1.0
print(xstrtod('1.00000000000000003'))  # 1.0
print(xstrtod('1.00000000000000004'))  # 1.0
print(xstrtod('1.00000000000000005'))  # 1.0
print(xstrtod('1.00000000000000006'))  # 1.0
print(xstrtod('1.00000000000000007'))  # 1.0
print(xstrtod('1.00000000000000008'))  # 1.0
print(xstrtod('1.00000000000000009'))  # 1.0000000000000002
print(xstrtod('1.00000000000000019'))  # 1.0000000000000002

问题似乎是最后一个9改变了结果.这就是浮点精度:

>>> float('100000000000000008')
1e+17
>>> float('100000000000000009')
1.0000000000000002e+17

最后导致错误结果的是9.


如果要高精度,可以定义自己的转换器或使用python提供的转换器,即decimal.Decimal如果要提高精度:

>>> import pandas
>>> import decimal
>>> converter = {0: decimal.Decimal}  # parse column 0 as decimals
>>> import io
>>> def parse(string):
...     return '{:.30f}'.format(pd.read_csv(io.StringIO(string), converters=converter)["column"][0])
>>> print(parse("column\n0.99999999999999998"))
>>> print(parse("column\n0.99999999999999999"))
>>> print(parse("column\n1.00000000000000000"))
>>> print(parse("column\n1.00000000000000001"))
>>> print(parse("column\n1.00000000000000008"))
>>> print(parse("column\n1.00000000000000009"))

打印:

0.999999999999999980000000000000
0.999999999999999990000000000000
1.000000000000000000000000000000
1.000000000000000010000000000000
1.000000000000000080000000000000
1.000000000000000090000000000000

完全代表输入!

I am having problems reading probabilities from CSV using pandas.read_csv; some of the values are read as floats with > 1.0.

Specifically, I am confused about the following behavior:

>>> pandas.read_csv(io.StringIO("column\n0.99999999999999998"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999999"))["column"][0]
1.0000000000000002
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000000"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000001"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000008"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000009"))["column"][0]
1.0000000000000002

Default float-parsing behavior seems to be non-monotonic, and especially some values starting 0.9... are converted to floats that are strictly greater than 1.0, causing problems e.g. when feeding them into sklearn.metrics.

The documentation states that read_csv has a parameter float_precision that can be used to select "which converter the C engine should use for floating-point values", and setting this to 'high' indeed solves my problem.

However, I would like to understand the default behavior:

  1. Where can I find the source code of the default float converter?
  2. Where can I find documentation on the intended behavior of the default float converter and the other possible choices?
  3. Why does a single-figure change in the least significant position skip a value?
  4. Why does this behave non-monotonically at all?

Edit regarding "duplicate question": This is not a duplicate. I am aware of the limitations of floating-point math. I was specifically asking about the default parsing mechanism in Pandas, since the builtin float does not show this behavior:

>>> float("0.99999999999999999")
1.0

...and I could not find documentation.

解决方案

@MaxU already showed the source code for the parser and the relevant tokenizer xstrtod so I'll focus on the "why" part:

The code for xstrtod is roughly like this (translated to pure Python):

def xstrtod(p):
    number = 0.
    idx = 0
    ndecimals = 0

    while p[idx].isdigit():
        number = number * 10. + int(p[idx])
        idx += 1

    idx += 1

    while idx < len(p) and p[idx].isdigit():
        number = number * 10. + int(p[idx])
        idx += 1
        ndecimals += 1

    return number / 10**ndecimals

Which reproduces the "problem" you saw:

print(xstrtod('0.99999999999999997'))  # 1.0
print(xstrtod('0.99999999999999998'))  # 1.0
print(xstrtod('0.99999999999999999'))  # 1.0000000000000002
print(xstrtod('1.00000000000000000'))  # 1.0
print(xstrtod('1.00000000000000001'))  # 1.0
print(xstrtod('1.00000000000000002'))  # 1.0
print(xstrtod('1.00000000000000003'))  # 1.0
print(xstrtod('1.00000000000000004'))  # 1.0
print(xstrtod('1.00000000000000005'))  # 1.0
print(xstrtod('1.00000000000000006'))  # 1.0
print(xstrtod('1.00000000000000007'))  # 1.0
print(xstrtod('1.00000000000000008'))  # 1.0
print(xstrtod('1.00000000000000009'))  # 1.0000000000000002
print(xstrtod('1.00000000000000019'))  # 1.0000000000000002

The problem seems to be the 9 in the last place which alters the result. So it's floating point accuracy:

>>> float('100000000000000008')
1e+17
>>> float('100000000000000009')
1.0000000000000002e+17

It's the 9 in the last place that is responsible for the skewed results.


If you want high precision you can define your own converters or use python-provided ones, i.e. decimal.Decimal if you want arbitary precision:

>>> import pandas
>>> import decimal
>>> converter = {0: decimal.Decimal}  # parse column 0 as decimals
>>> import io
>>> def parse(string):
...     return '{:.30f}'.format(pd.read_csv(io.StringIO(string), converters=converter)["column"][0])
>>> print(parse("column\n0.99999999999999998"))
>>> print(parse("column\n0.99999999999999999"))
>>> print(parse("column\n1.00000000000000000"))
>>> print(parse("column\n1.00000000000000001"))
>>> print(parse("column\n1.00000000000000008"))
>>> print(parse("column\n1.00000000000000009"))

which prints:

0.999999999999999980000000000000
0.999999999999999990000000000000
1.000000000000000000000000000000
1.000000000000000010000000000000
1.000000000000000080000000000000
1.000000000000000090000000000000

Exactly representing the input!

这篇关于了解pandas.read_csv()浮点解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆