Scikit-learn 标签编码器:如何保留批次之间的映射? [英] Scikit-learn labelencoder: how to preserve mappings between batches?
问题描述
我有 1.85 亿个样本,每个样本大约 3.8 MB.为了准备我的数据集,我需要对许多特征进行一次热编码,之后我会得到超过 15,000 个特征.
I have 185 million samples that will be about 3.8 MB per sample. To prepare my dataset, I will need to one-hot encode many of the features after which I end up with over 15,000 features.
但是我需要批量准备数据集,因为仅使用 300 万个样本的热编码时,仅针对特征的内存占用就超过 100 GB.
But I need to prepare the dataset in batches since the memory footprint exceeds 100 GB for just the features alone when one hot encoding using only 3 million samples.
问题是如何保留批次之间的编码/映射/标签?批次不一定具有类别的所有级别.也就是说,第 1 批可能有:巴黎、东京、罗马.
第 2 批可能有巴黎、伦敦.但最终我需要将巴黎、东京、罗马、伦敦一次性全部映射到一种编码.
The question is how to preserve the encodings/mappings/labels between batches?
The batches are not going to have all the levels of a category necessarily. That is, batch #1 may have: Paris, Tokyo, Rome.
Batch #2 may have Paris, London.
But in the end I need to have Paris, Tokyo, Rome, London all mapped to one encoding all at once.
假设我无法一次确定我的 1.85 亿 Cities 列的级别,因为它不适合 RAM,我该怎么办?如果我将相同的 Labelencoder 实例应用于不同的批次,映射是否会保持不变?在此之后,我还需要使用 scikitlearn 或 Keras 的 np_utilities_to_categorical 批量使用一种热编码.同样的问题:如何基本上批量使用这三种方法或将它们一次应用于存储在磁盘上的文件格式?
Assuming that I can not determine the levels of my Cities column of 185 million all at once since it won't fit in RAM, what should I do? If I apply the same Labelencoder instance to different batches will the mappings remain the same? I also will need to use one hot encoding either with scikitlearn or Keras' np_utilities_to_categorical in batches as well after this. So same question: how to basically use those three methods in batches or apply them at once to a file format stored on disk?
推荐答案
我建议为此使用 Pandas 的 get_dummies()
,因为 sklearn 的 OneHotEncoder()
需要查看.fit()
时所有可能的分类值,否则在.transform()
时遇到新的会抛出错误.
I suggest using Pandas' get_dummies()
for this, since sklearn's OneHotEncoder()
needs to see all possible categorical values when .fit()
, otherwise it will throw an error when it encounters a new one during .transform()
.
# Create toy dataset and split to batches
data_column = pd.Series(['Paris', 'Tokyo', 'Rome', 'London', 'Chicago', 'Paris'])
batch_1 = data_column[:3]
batch_2 = data_column[3:]
# Convert categorical feature column to matrix of dummy variables
batch_1_encoded = pd.get_dummies(batch_1, prefix='City')
batch_2_encoded = pd.get_dummies(batch_2, prefix='City')
# Row-bind (append) Encoded Data Back Together
final_encoded = pd.concat([batch_1_encoded, batch_2_encoded], axis=0)
# Final wrap-up. Replace nans with 0, and convert flags from float to int
final_encoded = final_encoded.fillna(0)
final_encoded[final_encoded.columns] = final_encoded[final_encoded.columns].astype(int)
final_encoded
输出
City_Chicago City_London City_Paris City_Rome City_Tokyo
0 0 0 1 0 0
1 0 0 0 0 1
2 0 0 0 1 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 1 0 0
这篇关于Scikit-learn 标签编码器:如何保留批次之间的映射?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!