什么是检测pandas.DataFrame中的列是否为分类的良好启发式方法? [英] What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

查看:85
本文介绍了什么是检测pandas.DataFrame中的列是否为分类的良好启发式方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在开发一种工具,该工具可以自动预处理pandas.DataFrame格式的数据.在此预处理步骤中,我想对连续数据和分类数据进行不同的处理.特别是,我希望能够将例如OneHotEncoder应用于类别数据.

I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.

现在,假设我们提供了pandas.DataFrame,并且没有有关DataFrame中数据的其他信息.用什么好的启发式方法来确定pandas.DataFrame中的列是否是分类的?

Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?

我最初的想法是:

1)如果列中有字符串(例如,列数据类型为object),则该列很可能包含分类数据

1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data

2)如果列中某些百分比的值是唯一的(例如> = 20%),则该列很可能包含连续数据

2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data

我发现1)可以正常工作,但是2)的效果不是很好.我需要更好的试探法.您将如何解决这个问题?

I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?

编辑:有人要求我解释为什么2)不能正常工作.在一些测试案例中,我们在列中仍然有连续的值,但是在列中没有很多唯一的值.在这种情况下,2)中的启发式方法显然失败了.在某些情况下,我们还有一个类别列,其中包含许多唯一值,例如Titanic数据集中的乘客姓名.那里存在相同的列类型分类错误问题.

Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.

推荐答案

以下是几种方法:

  1. 查找唯一值数量与唯一值总数的比率.类似于以下内容

  1. Find the ratio of number of unique values to the total number of unique values. Something like the following

likely_cat = {}
for var in df.columns:
    likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold

  • 检查前n个唯一值是否占所有值的一定比例以上

  • Check if the top n unique values account for more than a certain proportion of all values

    top_n = 10 
    likely_cat = {}
    for var in df.columns:
        likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
    

  • 方法1)通常对我而言比方法2)更好.但是,如果存在长尾分布",则方法2)更好,在这种情况下,少量分类变量具有较高的频率,而大量分类变量具有较低的频率.

    Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

    这篇关于什么是检测pandas.DataFrame中的列是否为分类的良好启发式方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆