如何编码具有高基数的分类特征? [英] How to encode a categorical feature with high cardinality?

查看:56
本文介绍了如何编码具有高基数的分类特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被困在一个包含一些具有高基数的分类特征的数据集中.就像'item_description'...我读了一些叫做哈希"的技巧,但是它的主要思想仍然是模糊和难以理解的,我还读了一个名为功能引擎"的库,但是我没有真正找到可以解决我的问题的东西.有什么建议吗?

Im stuck in a dataset that contains some categrotical features with a high cardinality. like 'item_description' ... I read about some trick called hashing, but its main idea is still blurry and incomprehensible, i also read about a library called 'Feature engine' but i didn't really find something that might solve my issue. Any suggestions please?

推荐答案

选项:

i)使用目标编码.

此处有关类别变量的良好教程: https://www.coursera.org/learn/competitive-data-science#syllabus [部分:关于模型的特征预处理和生成,第三个视频]

Good tutorial on categorical variables here: https://www.coursera.org/learn/competitive-data-science#syllabus [Section: Feature Preprocessing and Generation with Respect to Models , 3rd Video]

ii)使用实体嵌入:简而言之,该技术通过向量表示每个类别,然后进行训练以获取类别的特征.

ii) Use entity embeddings: In short, this technique represent each category by a vector, then training to obtain the characteristics of the category.

笔记本的实现方式

  1. https://www.kaggle.com/aquatic/entity-embedding-neural-net
  2. https://www.kaggle.com/abhishek/same-old-实体嵌入

iii)使用Catboost:

iii) Use Catboost :

额外:有一种哈希技巧也可能会有所帮助: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087?gi=3045c6e13ee5

Extra: There is a hashing trick technique which might also be helpful: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087?gi=3045c6e13ee5

这篇关于如何编码具有高基数的分类特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆