预处理具有分类和连续功能的大型数据文件 [英] Preprocess large datafile with categorical and continuous features

查看:69
本文介绍了预处理具有分类和连续功能的大型数据文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先感谢您阅读我的文章,如果您能提供任何帮助我解决此问题的线索,也非常感谢。

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.

由于我是Scikit-learn的新手,请毫不犹豫地提供任何建议,以帮助我改进流程并使其更加专业。

As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.

我的目标是对两个类别之间的数据进行分类。我想找到一个可以给我最精确结果的解决方案。目前,我仍在寻找最合适的算法和数据预处理。

My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.

在我的数据中,我有24个值:13个为名义值,6个为二值化值,其他为连续值。这是一行示例

In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line


RENAULT; CLIO III; CLIO III(2005-2010);柴油; 2010; HOM; _ AAA; _ BBB; _ CC; 0; 668.77; 3; Fevrier; _ DDD; 0; 0; 0; 1; 0; 0; 0; 0; 0; 0; 247.97

"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97

我有大约90万行学习线,我的测试超过10万行

I have around 900K lines for learning and I do my test over 100K lines

由于我想比较几种算法实现,因此我想对所有标称值进行编码,以便可以在多个分类器中使用。

As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.

我尝试了几件事:


  1. LabelEncoder:这很好,但是它给我排序的值,分类器可能会误解。

  2. OneHotEncoder:如果我了解得很好,那么它非常适合我的需求,因为我可以选择要进行二值化的列。但是由于我有很多标称值,所以它总是出现在MemoryError中。此外,它的输入必须是数字,因此必须强制使用LabelEncode之前的所有内容。

  3. StandardScaler:这很有用,但不能满足我的需要。我决定将其集成以缩放我的连续值。

  4. FeatureHasher:首先,我不了解它的作用。然后,我看到它主要用于文本分析。我试图用它来解决我的问题。我通过创建一个包含转换结果的新数组来作弊。我认为它不是以这种方式工作的,甚至也不合逻辑。

  5. DictVectorizer:可能有用,但看起来像OneHotEncoder并将更多数据存储在内存中。

  6. partial_fit:此方法仅由5个分类器给出。我希望能够至少与Perceptron,KNearest和RandomForest做到这一点,因此它不符合我的需求

  1. LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
  2. OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
  3. StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
  4. FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
  5. DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
  6. partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs

在文档上找到这些信息,并在预处理特征提取

I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.

我会希望有一种对所有标称值进行编码的方法,这样它们就不会被视为有序的。此解决方案可以应用于类别很多且资源薄弱的大型数据集。

I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.

有什么我没有探索的方法可以满足我的需求吗?

Is there any way I didn't explore that can fit my needs?

感谢任何线索和帮助

推荐答案

要转换无序分类功能,可以尝试 get_dummies pandas 中,更多详细信息可以参考其文档。另一种方法是使用 catboost ,它可以直接处理分类特征,而无需将其转换为数字类型。

To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

这篇关于预处理具有分类和连续功能的大型数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆