预处理具有分类和连续功能的大型数据文件 [英] Preprocess large datafile with categorical and continuous features

查看：69 发布时间：2020/9/30 0:28:10 python scikit-learn classification categorical-data

本文介绍了预处理具有分类和连续功能的大型数据文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

首先感谢您阅读我的文章，如果您能提供任何帮助我解决此问题的线索，也非常感谢。

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.

由于我是Scikit-learn的新手，请毫不犹豫地提供任何建议，以帮助我改进流程并使其更加专业。

As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.

我的目标是对两个类别之间的数据进行分类。我想找到一个可以给我最精确结果的解决方案。目前，我仍在寻找最合适的算法和数据预处理。

My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.

在我的数据中，我有24个值：13个为名义值，6个为二值化值，其他为连续值。这是一行示例

In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line

RENAULT; CLIO III; CLIO III（2005-2010）;柴油; 2010; HOM; _ AAA; _ BBB; _ CC; 0; 668.77; 3; Fevrier; _ DDD; 0; 0; 0; 1; 0; 0; 0; 0; 0; 0; 247.97

"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97

我有大约90万行学习线，我的测试超过10万行

I have around 900K lines for learning and I do my test over 100K lines

由于我想比较几种算法实现，因此我想对所有标称值进行编码，以便可以在多个分类器中使用。

As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.

我尝试了几件事：

LabelEncoder：这很好，但是它给我排序的值，分类器可能会误解。

OneHotEncoder：如果我了解得很好，那么它非常适合我的需求，因为我可以选择要进行二值化的列。但是由于我有很多标称值，所以它总是出现在MemoryError中。此外，它的输入必须是数字，因此必须强制使用LabelEncode之前的所有内容。

StandardScaler：这很有用，但不能满足我的需要。我决定将其集成以缩放我的连续值。

FeatureHasher：首先，我不了解它的作用。然后，我看到它主要用于文本分析。我试图用它来解决我的问题。我通过创建一个包含转换结果的新数组来作弊。我认为它不是以这种方式工作的，甚至也不合逻辑。

DictVectorizer：可能有用，但看起来像OneHotEncoder并将更多数据存储在内存中。

partial_fit：此方法仅由5个分类器给出。我希望能够至少与Perceptron，KNearest和RandomForest做到这一点，因此它不符合我的需求

LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs

在文档上找到这些信息，并在预处理和特征提取。

I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.

我会希望有一种对所有标称值进行编码的方法，这样它们就不会被视为有序的。此解决方案可以应用于类别很多且资源薄弱的大型数据集。

I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.

有什么我没有探索的方法可以满足我的需求吗？

Is there any way I didn't explore that can fit my needs?

感谢任何线索和帮助

预处理具有分类和连续功能的大型数据文件 [英] Preprocess large datafile with categorical and continuous features

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

预处理具有分类和连续功能的大型数据文件 [英] Preprocess large datafile with categorical and continuous features

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭