如何合并数值模型和嵌入序列模型来处理 RNN 中的类别 [英] How to Merge Numerical and Embedding Sequential Models to treat categories in RNN

查看:25
本文介绍了如何合并数值模型和嵌入序列模型来处理 RNN 中的类别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为我的分类特征构建一个带有嵌入的单层 LSTM 模型.我目前有数字特征和一些分类特征,例如位置,它们不能是单热编码的,例如由于计算复杂性,使用 pd.get_dummies(),这是我最初打算做的.

I would like to build a one layer LSTM model with embeddings for my categorical features. I currently have numerical features and a few categorical features, such as Location, which can't be one-hot encoded e.g. using pd.get_dummies() due to computational complexity, which is what I originally intended to do.

让我们想象一个例子:

data = {
    'user_id': [1,1,1,1,2,2,3],
    'time_on_page': [10,20,30,20,15,10,40],
    'location': ['London','New York', 'London', 'New York', 'Hong Kong', 'Tokyo', 'Madrid'],
    'page_id': [5,4,2,1,6,8,2]
}
d = pd.DataFrame(data=data)
print(d)
   user_id  time_on_page   location  page_id
0        1            10     London        5
1        1            20   New York        4
2        1            30     London        2
3        1            20   New York        1
4        2            15  Hong Kong        6
5        2            10      Tokyo        8
6        3            40     Madrid        2

让我们看看访问网站的人.我正在跟踪数字数据,例如页面停留时间等.分类数据包括:位置(超过 1000 个唯一值)、Page_id(> 1000 个唯一值)、Author_id(100+ 个唯一值).最简单的解决方案是对所有内容进行 one-hot 编码,并将其放入具有可变序列长度的 LSTM 中,每个时间步长对应于不同的页面视图.

Let's look at the person visiting a website. I'm tracking numerical data such as time on page and others. Categorical data includes: Location (over 1000 uniques), Page_id (> 1000 uniques), Author_id (100+ uniques). The simplest solution would be to one-hot encoding everything and put this into LSTM with variable sequence lengths, each timestep corresponding to a different page view.

上述DataFrame将生成7个训练样本,序列长度可变.例如,对于 user_id=2,我将有 2 个训练样本:

The above DataFrame will generate 7 training samples, with variable sequence lengths. For example, for user_id=2 I will have 2 training samples:

[ ROW_INDEX_4 ] and [ ROW_INDEX_4, ROW_INDEX_5 ]

X为训练数据,我们来看第一个训练样本X[0].

Let X be the training data, and let's look at the first training sample X[0].

从上图来看,我的分类特征是X[0][:, n:].

From the picture above, my categorical features are X[0][:, n:].

在创建序列之前,我使用 pd.factorize() 将分类变量分解为 [0,1... number_of_cats-1],所以 [0,1... number_of_cats-1] 中的数据code>X[0][:, n:] 是对应于它们的索引的数字.

Before creating sequences, I factorized the categorical variables into [0,1... number_of_cats-1], using pd.factorize() so the data in X[0][:, n:] is numbers corresponding to their index.

我是否需要为每个分类特征分别创建一个 Embedding?例如.每个 x_*n, x_*n+1, ..., x_*m?

Do I need to create an Embedding for each of the Categorical Features separately? E.g. an embedding for each of x_*n, x_*n+1, ..., x_*m?

如果是这样,我如何将其放入 Keras 代码中?

If so, how do I put this into Keras code?

model = Sequential()

model.add(Embedding(?, ?, input_length=variable)) # How do I feed the data into this embedding? Only the categorical inputs.

model.add(LSTM())
model.add(Dense())
model.add.Activation('sigmoid')
model.compile()

model.fit_generator() # fits the `X[i]` one by one of variable length sequences.

我的解决思路:

看起来像:

我可以在每个分类特征 (m-n) 上训练 Word2Vec 模型,以向量化任何给定值.例如.伦敦将在 3 个维度上进行矢量化.假设我使用 3 维嵌入.然后我将所有东西都放回到 X 矩阵中,它现在有 n + 3(n-m),并使用 LSTM 模型来训练它?

I can train a Word2Vec model on every single categorical feature (m-n) to vectorise any given value. E.g. London will be vectorised in 3 dimensions. Let's suppose I use 3 dimensional embeddings. Then I will put everything back into the X matrix, which will now have n + 3(n-m), and use the LSTM model to train it?

我只是认为应该有一种更简单/更智能的方法.

I just think there should be an easier/smarter way.

推荐答案

正如您提到的,一种解决方案是对分类数据进行一次性编码(或者甚至以基于索引的格式使用它们)和提要它们沿着数值数据传输到 LSTM 层.当然,您也可以在这里有两个 LSTM 层,一个用于处理数值数据,另一个用于处理分类数据(采用单热编码格式或基于索引的格式),然后合并它们的输出.

One solution, as you mentioned, is to one-hot encode the categorical data (or even use them as they are, in index-based format) and feed them along the numerical data to an LSTM layer. Of course, you can also have two LSTM layers here, one for processing the numerical data and another for processing categorical data (in one-hot encoded format or index-based format) and then merge their outputs.

另一种解决方案是为每个分类数据设置一个单独的嵌入层.每个嵌入层都可能有自己的嵌入维度(如上文所述,您可能有多个 LSTM 层来分别处理数值和分类特征):

Another solution is to have one separate embedding layer for each of those categorical data. Each embedding layer may have its own embedding dimension (and as suggested above, you may have more than one LSTM layer for processing numerical and categorical features separately):

num_cats = 3 # number of categorical features
n_steps = 100 # number of timesteps in each sample
n_numerical_feats = 10 # number of numerical features in each sample
cat_size = [1000, 500, 100] # number of categories in each categorical feature
cat_embd_dim = [50, 10, 100] # embedding dimension for each categorical feature

numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
cat_inputs = []
for i in range(num_cats):
    cat_inputs.append(Input(shape=(n_steps,1), name='cat' + str(i+1) + '_input'))

cat_embedded = []
for i in range(num_cats):
    embed = TimeDistributed(Embedding(cat_size[i], cat_embd_dim[i]))(cat_inputs[i])
    cat_embedded.append(embed)

cat_merged = concatenate(cat_embedded)
cat_merged = Reshape((n_steps, -1))(cat_merged)
merged = concatenate([numerical_input, cat_merged])
lstm_out = LSTM(64)(merged)

model = Model([numerical_input] + cat_inputs, lstm_out)
model.summary()

这是模型摘要:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
cat1_input (InputLayer)         (None, 100, 1)       0                                            
__________________________________________________________________________________________________
cat2_input (InputLayer)         (None, 100, 1)       0                                            
__________________________________________________________________________________________________
cat3_input (InputLayer)         (None, 100, 1)       0                                            
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, 100, 1, 50)   50000       cat1_input[0][0]                 
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, 100, 1, 10)   5000        cat2_input[0][0]                 
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, 100, 1, 100)  10000       cat3_input[0][0]                 
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 100, 1, 160)  0           time_distributed_1[0][0]         
                                                                 time_distributed_2[0][0]         
                                                                 time_distributed_3[0][0]         
__________________________________________________________________________________________________
numeric_input (InputLayer)      (None, 100, 10)      0                                            
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 100, 160)     0           concatenate_1[0][0]              
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 100, 170)     0           numeric_input[0][0]              
                                                                 reshape_1[0][0]                  
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 64)           60160       concatenate_2[0][0]              
==================================================================================================
Total params: 125,160
Trainable params: 125,160
Non-trainable params: 0
__________________________________________________________________________________________________

然而,您可以尝试另一种解决方案:只需为所有分类特征设置一个嵌入层.但是它涉及一些预处理:您需要重新索引所有类别以使它们彼此不同.例如,第一个分类特征中的类别将从 1 编号到 size_first_cat,然后第二个分类特征中的类别将从 size_first_cat + 1 编号到 size_first_cat + size_second_cat 等等.然而,在这个解决方案中,所有的分类特征都具有相同的嵌入维度,因为我们只使用了一个嵌入层.

Yet there is another solution which you can try: just have one embedding layer for all the categorical features. It involves some preprocessing though: you need to re-index all the categories to make them distinct from each other. For example, the categories in first categorical feature would be numbered from 1 to size_first_cat and then the categories in the second categorical feature would be numbered from size_first_cat + 1 to size_first_cat + size_second_cat and so on. However, in this solution all the categorical features would have the same embedding dimension since we are using only one embedding layer.

更新:现在我想到了,你也可以在数据预处理阶段甚至在模型中重塑分类特征,以摆脱TimeDistributed层和Reshape 层(这也可以提高训练速度):

Update: Now that I think about it, you can also reshape the categorical features in data preprocessing stage or even in the model to get rid of TimeDistributed layers and the Reshape layer (and this may increase the training speed as well):

numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
cat_inputs = []
for i in range(num_cats):
    cat_inputs.append(Input(shape=(n_steps,), name='cat' + str(i+1) + '_input'))

cat_embedded = []
for i in range(num_cats):
    embed = Embedding(cat_size[i], cat_embd_dim[i])(cat_inputs[i])
    cat_embedded.append(embed)

cat_merged = concatenate(cat_embedded)
merged = concatenate([numerical_input, cat_merged])
lstm_out = LSTM(64)(merged)

model = Model([numerical_input] + cat_inputs, lstm_out)

对于模型的拟合,你需要用自己对应的numpy数组分别喂给每个输入层,例如:

As for fitting the model, you need to feed each input layer separately with its own corresponding numpy array, for example:

X_tr_numerical = X_train[:,:,:n_numerical_feats]

# extract categorical features: you can use a for loop to this as well.
# note that we reshape categorical features to make them consistent with the updated solution
X_tr_cat1 = X_train[:,:,cat1_idx].reshape(-1, n_steps) 
X_tr_cat2 = X_train[:,:,cat2_idx].reshape(-1, n_steps)
X_tr_cat3 = X_train[:,:,cat3_idx].reshape(-1, n_steps)

# don't forget to compile the model ...

# fit the model
model.fit([X_tr_numerical, X_tr_cat1, X_tr_cat2, X_tr_cat3], y_train, ...)

# or you can use input layer names instead
model.fit({'numeric_input': X_tr_numerical,
           'cat1_input': X_tr_cat1,
           'cat2_input': X_tr_cat2,
           'cat3_input': X_tr_cat3}, y_train, ...)

如果你想使用 fit_generator() 没有区别:

If you would like to use fit_generator() there is no difference:

# if you are using a generator
def my_generator(...):

    # prep the data ...

    yield [batch_tr_numerical, batch_tr_cat1, batch_tr_cat2, batch_tr_cat3], batch_tr_y

    # or use the names
    yield {'numeric_input': batch_tr_numerical,
           'cat1_input': batch_tr_cat1,
           'cat2_input': batch_tr_cat2,
           'cat3_input': batch_tr_cat3}, batch_tr_y

model.fit_generator(my_generator(...), ...)

# or if you are subclassing Sequence class
class MySequnece(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        # initialize the data

    def __getitem__(self, idx):
        # fetch data for the given batch index (i.e. idx)

        # same as the generator above but use `return` instead of `yield`

model.fit_generator(MySequence(...), ...)

这篇关于如何合并数值模型和嵌入序列模型来处理 RNN 中的类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆