在Pandas DataFrame中检查dtype时的警告 [英] Caveats while checking dtype in pandas DataFrame

查看:646
本文介绍了在Pandas DataFrame中检查dtype时的警告的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

指导回答我开始根据其dtype建立用于处理数据帧列的管道.但是在得到一些意外的输出和调试之后,我最终得到了测试数据帧和测试dtype检查:

Guided by this answer I started to build up pipe for processing columns of dataframe based on its dtype. But after getting some unexpected output and some debugging i ended up with test dataframe and test dtype checking:

# Creating test dataframe
test = pd.DataFrame({'bool' :[False, True], 'int':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['cat'] = test['cat'].astype('category')
test
test.dtypes

# Testing types
types = list(test.columns)
df_types = pd.DataFrame(np.zeros((len(types),len(types)), dtype=bool),
                        index = ['is_'+el for el in types],
                        columns = types)
for col in test.columns:
    df_types.at['is_bool', col] = pd.api.types.is_bool_dtype(test[col])
    df_types.at['is_int' , col] = pd.api.types.is_integer_dtype(test[col])
    df_types.at['is_float',col] = pd.api.types.is_float_dtype(test[col])
    df_types.at['is_compl',col] = pd.api.types.is_complex_dtype(test[col])
    df_types.at['is_dt'  , col] = pd.api.types.is_datetime64_dtype(test[col])
    df_types.at['is_td'  , col] = pd.api.types.is_timedelta64_dtype(test[col])
    df_types.at['is_prd' , col] = pd.api.types.is_period_dtype(test[col])
    df_types.at['is_intrv',col] = pd.api.types.is_interval_dtype(test[col])
    df_types.at['is_str' , col] = pd.api.types.is_string_dtype(test[col])
    df_types.at['is_cat' , col] = pd.api.types.is_categorical_dtype(test[col])
    df_types.at['is_obj' , col] = pd.api.types.is_object_dtype(test[col])

# Styling func
def coloring(df):
    clr_g = 'color : green'
    clr_r = 'color : red'
    mask = ~np.logical_xor(df.values, np.eye(df.shape[0], dtype=bool))
    # OUTPUT
    return pd.DataFrame(np.where(mask, clr_g, clr_r),
                        index = df.index,
                        columns = df.columns)

# OUTPUT colored
df_types.style.apply(coloring, axis=None)

输出:

bool                  bool
int                  int64
float              float64
compl           complex128
dt          datetime64[ns]
td         timedelta64[ns]
prd              period[D]
intrv    interval[float64]
str                 object
cat               category
obj                 object

几乎一切都很好,但是此测试代码产生两个问题:

Almost everything is good, but this test code produces two questions:

  1. 最奇怪的是pd.api.types.is_string_dtype触发 在category dtype上.这是为什么?是否应将其视为预期" 行为?
  2. 为什么在每个上触发is_string_dtypeis_object_dtype 其他?有点意外,因为即使在.dtypes中,这两种类型 被标记为object,但是如果有人对其进行说明会更好 一步一步来.
  1. The most strange here is that pd.api.types.is_string_dtype fires on category dtype. Why is that? Should it be treated as 'expected' behavior?
  2. Why is_string_dtype and is_object_dtype fires on each other? This is a bit expected, because even in .dtypes both types are noted as object, but it would be better if someone clarify it step by step.

Ps:奖金问题-我认为熊猫在构建新版本时应该通过其内部测试是正确的(例如测试代码中的df_types,但不带有红色",而是记录有关错误的信息") ?

P.s.: Bonus question - am i right when thinking that pandas has its internal tests that should be passed when building new release (like df_types from test code, but not with 'coloring in red' rather 'recording info about errors')?

熊猫版0.24.2.

推荐答案

这归结为is_string_dtype相当宽松的检查,其中

This comes down to is_string_dtype being a fairly loose check, with the implementation even having a TODO note to make it more strict, linking to Issue #15585.

此检查不严格的原因是,在pandas中没有专用的字符串dtype,而是仅使用object dtype存储了字符串,该字符串实际上可以存储任何内容.因此,更严格的检查可能会带来性能开销.

The reason this check is not strict is because there isn't a dedicated string dtype in pandas, and instead strings are just stored with object dtype, which could really store anything. As such, a more strict check would likely introduce a performance overhead.

要回答您的问题:

  1. 这是将CategoricalDtype.kind设置为'O'的结果,这是is_string_dtype所做的宽松检查之一.鉴于待办事项说明,这种情况将来可能会改变,所以这不是我要依靠的.

  1. This is a result of CategoricalDtype.kind being set to 'O', which is one of the loose checks is_string_dtype does. This could probably change in the future given the TODO note, so it's not something I'd rely upon.

由于字符串存储为object dtype,因此is_object_dtype在字符串上触发是有意义的,并且我认为这种行为是可靠的,因为在不久的将来几乎可以肯定不会改变实现.由于对is_string_dtype中的dtype.kind的依赖,情况恰恰相反,与上述分类法有相同的警告.

Since strings are stored as object dtype it makes sense for is_object_dtype to fire on strings, and I'd consider this behavior to be reliable as the implementation will almost certainly not change in the immediate future. The reverse is true due to the reliance on dtype.kind in is_string_dtype, which has the same caveats as with categoricals described above.

是的,pandas具有一个测试套件,该套件将针对所创建的每个PR在各种CI服务上自动运行.该测试套件包括与您正在执行的检查类似的检查.

Yes, pandas has a test suite that will run automatically on various CI services for every PR that's created. The test suite includes checks similar to what you're doing.

要添加的一个切线相关的注释:有一个名为 fletcher 的库,该库使用Apache Arrow以与pandas兼容的方式实现更本地的字符串类型.它仍在开发中,目前可能不支持pandas所做的所有字符串操作.

One tangentially related note to add: there is a library called fletcher that uses Apache Arrow to implement a more native string type in a way that's compatible with pandas. It's still under development and probably doesn't currently have support for all the string operations that pandas does.

这篇关于在Pandas DataFrame中检查dtype时的警告的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆