为什么将字符串总和转换为浮点数 [英] why is a sum of strings converted to floats

查看:93
本文介绍了为什么将字符串总和转换为浮点数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下数据框(注意字符串):

consider the following dataframe (note the strings):

df = pd.DataFrame([['3', '11'], ['0', '2']], columns=list('AB'))
df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
A    2 non-null object
B    2 non-null object
dtypes: object(2)
memory usage: 104.0+ bytes

问题

我要总结一下.我希望这些字符串可以串联起来.

Question

I'm going to sum. I expect the strings to be concatenated.

df.sum()

A     30.0
B    112.0
dtype: float64

看起来字符串已连接在一起,然后转换为float.是否有充分的理由呢?这是一个错误吗?任何启发性的事情都将被投票通过.

It looks as though the strings were concatenated then converted to float. Is there a good reason for this? Is this a bug? Anything enlightening will be up voted.

推荐答案

使用了良好的旧堆栈跟踪.也通过Pycharm学习了有关pdb的知识.原来发生了以下情况:

Went with the good old stack trace. Learned a bit about pdb through Pycharm as well. Turns out what happens is the following:

1)

cls.sum = _make_stat_function(
            'sum', name, name2, axis_descr,
            'Return the sum of the values for the requested axis',
            nanops.nansum)

让我们看看_make_stat_function

2)

def _make_stat_function(name, name1, name2, axis_descr, desc, f):
    @Substitution(outname=name, desc=desc, name1=name1, name2=name2,
                  axis_descr=axis_descr)
    @Appender(_num_doc)
    def stat_func(self, axis=None, skipna=None, level=None, numeric_only=None,
                  **kwargs):
        _validate_kwargs(name, kwargs, 'out', 'dtype')

        if skipna is None:
            skipna = True
        if axis is None:
            axis = self._stat_axis_number
        if level is not None:
            return self._agg_by_level(name, axis=axis, level=level,
                                      skipna=skipna)
        return self._reduce(f, name, axis=axis, skipna=skipna,
                            numeric_only=numeric_only)

最后一行是关键.这很有趣,因为pandas.core中大约有7个不同的_reduces. pdb说它是pandas.core.frame中的那个.让我们看一下.

The last line is key. It's kind of funny, as there are about 7 different _reduces within pandas.core. pdb says it's the one in pandas.core.frame. Let's take a look.

3)

def _reduce(self, op, name, axis=0, skipna=True, numeric_only=None,
            filter_type=None, **kwds):
    axis = self._get_axis_number(axis)

    def f(x):
        return op(x, axis=axis, skipna=skipna, **kwds)

    labels = self._get_agg_axis(axis)

    # exclude timedelta/datetime unless we are uniform types
    if axis == 1 and self._is_mixed_type and self._is_datelike_mixed_type:
        numeric_only = True

    if numeric_only is None:
        try:
            values = self.values
            result = f(values)
        except Exception as e:

            # try by-column first
            if filter_type is None and axis == 0:
                try:

                    # this can end up with a non-reduction
                    # but not always. if the types are mixed
                    # with datelike then need to make sure a series
                    result = self.apply(f, reduce=False)
                    if result.ndim == self.ndim:
                        result = result.iloc[0]
                    return result
                except:
                    pass

            if filter_type is None or filter_type == 'numeric':
                data = self._get_numeric_data()
            elif filter_type == 'bool':
                data = self._get_bool_data()
            else:  # pragma: no cover
                e = NotImplementedError("Handling exception with filter_"
                                        "type %s not implemented." %
                                        filter_type)
                raise_with_traceback(e)
            result = f(data.values)
            labels = data._get_agg_axis(axis)
    else:
        if numeric_only:
            if filter_type is None or filter_type == 'numeric':
                data = self._get_numeric_data()
            elif filter_type == 'bool':
                data = self._get_bool_data()
            else:  # pragma: no cover
                msg = ("Generating numeric_only data with filter_type %s"
                       "not supported." % filter_type)
                raise NotImplementedError(msg)
            values = data.values
            labels = data._get_agg_axis(axis)
        else:
            values = self.values
        result = f(values)

    if hasattr(result, 'dtype') and is_object_dtype(result.dtype):
        try:
            if filter_type is None or filter_type == 'numeric':
                result = result.astype(np.float64)
            elif filter_type == 'bool' and notnull(result).all():
                result = result.astype(np.bool_)
        except (ValueError, TypeError):

            # try to coerce to the original dtypes item by item if we can
            if axis == 0:
                result = com._coerce_to_dtypes(result, self.dtypes)

    return Series(result, index=labels)

圣烟,谈论失控功能.有人需要重构!让我们放大故障线:

Holy smokes, talk about an out of control function. Someone needs a refactoring! Let's zoom in on the trouble line(s):

if hasattr(result, 'dtype') and is_object_dtype(result.dtype):
    try:
        if filter_type is None or filter_type == 'numeric':
            result = result.astype(np.float64)

您最好相信最后一行会被执行.这是一些pdb跟踪:

And you better believe that last line gets executed. Here's some of the pdb trace:

> c:\users\matthew\anaconda2\lib\site-packages\pandas\core\frame.py(4801)_reduce()
-> result = result.astype(np.float64)
(Pdb) l
4796                result = f(values)
4797    
4798            if hasattr(result, 'dtype') and is_object_dtype(result.dtype):
4799                try:
4800                    if filter_type is None or filter_type == 'numeric':
4801 ->                     result = result.astype(np.float64)
4802                    elif filter_type == 'bool' and notnull(result).all():
4803                        result = result.astype(np.bool_)
4804                except (ValueError, TypeError):
4805    
4806                    # try to coerce to the original dtypes item by item if we can

如果您不是信徒,请打开pandas.core.frame.py并在4801行的正上方放置一个print "OI".它应该随身带到控制台:).请注意,我在Windows的Anaconda 2上.

If you're a non-believer, open up pandas.core.frame.py and put a print "OI" right above line 4801. It should splat out to console :). Note I'm on Anaconda 2, windows.

我将使用"bug"来回答您的问题.

I'm going to go with "bug", to answer your question.

这篇关于为什么将字符串总和转换为浮点数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆