什么是检测pandas.DataFrame中的列是否为分类的良好启发式方法? [英] What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

查看：85 发布时间：2020/5/24 1:45:46 python pandas scikit-learn

本文介绍了什么是检测pandas.DataFrame中的列是否为分类的良好启发式方法?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在开发一种工具，该工具可以自动预处理pandas.DataFrame格式的数据.在此预处理步骤中，我想对连续数据和分类数据进行不同的处理.特别是，我希望能够将例如OneHotEncoder应用于仅类别数据.

I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.

现在，假设我们提供了pandas.DataFrame，并且没有有关DataFrame中数据的其他信息.用什么好的启发式方法来确定pandas.DataFrame中的列是否是分类的?

Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?

我最初的想法是:

1)如果列中有字符串(例如，列数据类型为object)，则该列很可能包含分类数据

1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data

2)如果列中某些百分比的值是唯一的(例如> = 20％)，则该列很可能包含连续数据

2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data

我发现1)可以正常工作，但是2)的效果不是很好.我需要更好的试探法.您将如何解决这个问题?

I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?

编辑:有人要求我解释为什么2)不能正常工作.在一些测试案例中，我们在列中仍然有连续的值，但是在列中没有很多唯一的值.在这种情况下，2)中的启发式方法显然失败了.在某些情况下，我们还有一个类别列，其中包含许多唯一值，例如Titanic数据集中的乘客姓名.那里存在相同的列类型分类错误问题.

Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.

推荐答案

以下是几种方法:

查找唯一值数量与唯一值总数的比率.类似于以下内容

Find the ratio of number of unique values to the total number of unique values. Something like the following

likely_cat = {}
for var in df.columns:
    likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold

检查前n个唯一值是否占所有值的一定比例以上

Check if the top n unique values account for more than a certain proportion of all values

top_n = 10 
likely_cat = {}
for var in df.columns:
    likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold

方法1)通常对我而言比方法2)更好.但是，如果存在长尾分布"，则方法2)更好，在这种情况下，少量分类变量具有较高的频率，而大量分类变量具有较低的频率.

Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

这篇关于什么是检测pandas.DataFrame中的列是否为分类的良好启发式方法?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

什么是检测pandas.DataFrame中的列是否为分类的良好启发式方法? [英] What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

什么是检测pandas.DataFrame中的列是否为分类的良好启发式方法? [英] What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭