如何始终如一地对具有变化值的数据帧进行热编码? [英] How to consistently hot encode dataframes with changing values?

查看:48
本文介绍了如何始终如一地对具有变化值的数据帧进行热编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以数据帧的形式获取内容流,每个批次在列中具有不同的值.例如,一批可能看起来像:

I'm getting a stream of content in the form of dataframes, each batch with different values in columns. For example one batch might look like:

day1_data = {'state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
            'city': ['C', 'B', 'G', 'Z', 'F'], 
            'age': [27, 19, 63, 40, 93]}

还有一个喜欢:

day2_data = {'state': ['AL', 'WY', 'VA'], 
            'city': ['A', 'B', 'E'], 
            'age': [42, 52, 73]}

如何以返回一致列数的方式对列进行热编码?

how can the columns be hot encoded in a way that returns a consistent number of columns?

如果我在每个批次上使用 pandas 的 get_dummies(),它会返回不同数量的列:

If I use pandas's get_dummies() on each of the batches, it returns a different number of columns:

df1 = pd.get_dummies(pd.DataFrame(day1_data))
df2 = pd.get_dummies(pd.DataFrame(day2_data))

len(df1.columns) == len(df2.columns)

我可以获得每列的所有可能值,问题是即使有了这些信息,为每个每日批次生成一个热编码以便列数保持一致的最简单方法是什么?

I can get all the possible values for each column, the question is even with that information what's the simplest way to generate the one hot encoding for each daily batch so the number of columns will be consistent?

推荐答案

好的,因为所有可能的值都是事先知道的.那么下面是一种稍微有点hackish的方法.

Okay since all the possible values are known in advance. Then below is a slightly hackish way of doing it.

import numpy as np
import pandas as pd

# This is a one time process
# Keep all the possible data here in lists
# Can add other categorical variables too which have this type of data
all_possible_states=  ['AL', 'MS', 'MS', 'OK', 'VA', 'NJ', 'NM', 'CD', 'WY']
all_possible_cities= ['A', 'B', 'C', 'D', 'E', 'G', 'Z', 'F']

# Declare our transformer class
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

class MyOneHotEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, all_possible_values):
        self.le = LabelEncoder()
        self.ohe = OneHotEncoder()
        self.ohe.fit(self.le.fit_transform(all_possible_values).reshape(-1,1))

    def transform(self, X, y=None):
        return self.ohe.transform(self.le.transform(X).reshape(-1,1)).toarray()

# Allow the transformer to see all the data here
encoders = {}
encoders['state'] = MyOneHotEncoder(all_possible_states)
encoders['city'] = MyOneHotEncoder(all_possible_cities)
# Do this for all categorical columns

# Now this is our method which will be used on the incoming data 
def encode(df):

    tup = (encoders['state'].transform(df['state']), 
           encoders['city'].transform(df['city']),
           # Add all other columns which are not to be transformed
           df[['age']])

    return np.hstack(tup)

# Testing:
day1_data = pd.DataFrame({'state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
        'city': ['C', 'B', 'G', 'Z', 'F'], 
        'age': [27, 19, 63, 40, 93]})

print(encode(day1_data))
[[  0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.
    0.   0.  27.]
 [  0.   0.   0.   0.   0.   1.   0.   0.   0.   1.   0.   0.   0.   0.
    0.   0.  19.]
 [  0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.
    1.   0.  63.]
 [  0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   1.  40.]
 [  0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   1.
    0.   0.  93.]]


day2_data = pd.DataFrame({'state': ['AL', 'WY', 'VA'], 
            'city': ['A', 'B', 'E'], 
            'age': [42, 52, 73]})

print(encode(day2_data))
[[  1.   0.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.
    0.   0.  42.]
 [  0.   0.   0.   0.   0.   0.   0.   1.   0.   1.   0.   0.   0.   0.
    0.   0.  52.]
 [  0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   1.   0.
    0.   0.  73.]]

请仔细阅读评论,如果仍有任何问题,请询问我.

Do go through the comments and if still any issue, ask me.

这篇关于如何始终如一地对具有变化值的数据帧进行热编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆