GridSearch使用gensim构建的doc2vec模型 [英] GridSearch for doc2vec model built using gensim

查看:175
本文介绍了GridSearch使用gensim构建的doc2vec模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为我训练有素的doc2vec gensim模型找到最佳的超参数,该模型将文档作为输入并创建其文档嵌入.我的火车数据包含文本文档,但没有任何标签.即,我只有'X'而不是'y'.

I am trying to find best hyperparameters for my trained doc2vec gensim model which takes a document as an input and create its document embeddings. My train data consists of text documents but it doesn't have any labels. i.e. I just have 'X' but not 'y'.

我在这里发现了一些与我想做的事情有关的问题,但是所有解决方案都是针对有监督的模型提出的,而没有针对像我这样的无监督模型提出的.

I found some questions here related to what I am trying to do but all of the solutions are proposed for supervised models but none for unsupervised like mine.

这是我训练doc2vec模型的代码:

Here is the code where I am training my doc2vec model:

def train_doc2vec(
    self,
    X: List[List[str]],
    epochs: int=10,
    learning_rate: float=0.0002) -> gensim.models.doc2vec:

    tagged_documents = list()

    for idx, w in enumerate(X):
        td = TaggedDocument(to_unicode(str.encode(' '.join(w))).split(), [str(idx)])
        tagged_documents.append(td)

    model = Doc2Vec(**self.params_doc2vec)
    model.build_vocab(tagged_documents)

    for epoch in range(epochs):
        model.train(tagged_documents,
                    total_examples=model.corpus_count,
                    epochs=model.epochs)
        # decrease the learning rate
        model.alpha -= learning_rate
        # fix the learning rate, no decay
        model.min_alpha = model.alpha

    return model

我需要有关如何使用GridSearch进行训练并为我的训练模型找到最佳超参数的建议,或者有关其他技术的任何建议.非常感谢您的帮助.

I need suggestions on how to proceed and find best hyperparameters for my trained model using GridSearch or any suggestions about some other technique. Help is much appreciated.

推荐答案

通过代码的正确性,我将尝试回答您有关如何执行超参数调整的问题. 您必须开始定义一组超参数,这些超参数将定义您的超参数网格搜索.对于每组超参数

Independently by the correctness of the code, I will try to answer to your question on how to perform a tuning of hyper-parameters. You have to start defining a set of hyper-parameters that will define your hyper-parameter grid search. For each set of hyper-parameters

Hset1 =(par1Value1,par2Value1,...,par3Value1)

Hset1=(par1Value1,par2Value1,...,par3Value1)

您可以在训练集上训练模型,并使用独立的验证集来衡量您的准确性(或您希望使用的任何度量).您存储此值(例如A_Hset1).当对所有可能的超参数集执行此操作时,您将拥有一组度量值

you train your model on the training set and you use an independent validation set to measure your accuracy (or whatever metrics you wish to use). You store this value (e.g. A_Hset1). When you do this for all the possible set of hyper-parameters you will have a set of measures

(A_Hset1,A_Hset2,A_Hset3 ... A_HsetK).

(A_Hset1,A_Hset2,A_Hset3...A_HsetK).

其中的每一项指标都可以告诉您,每组超参数的模型效果如何? 您的一组最佳超参数

Each one of those measure tells you how good is your model for each set of hyper-parameters so your set of of optimal hyper-parameters

H_setOptimal = HsetX | A_setX = max(A_Hset1,A_Hset2,A_Hset3 ... A_HsetK)

H_setOptimal= HsetX | A_setX=max(A_Hset1,A_Hset2,A_Hset3...A_HsetK)

为了进行公平的比较,您应该始终在相同的数据上训练模型,并始终使用相同的验证集.

In order to have a fair comparisons you should train the model always on the same data and use always the same validation set.

我不是Python的高级用户,所以您可能会发现更好的建议,但是我要做的是创建字典列表,每个字典都包含一组要测试的超参数:

I'm not an advanced Python user so probably you can find better suggestions around, but what I would do is to create a list of dictionaries, where each dictionary contain a set of hyper-parameters that you want to test:

grid_search=[{"par1":"val1","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val2","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val3","par2":"val1","par3":"val1",..., "res"=""},
             ,...,
             {"par1":"valn","par2":"valn","par3":"valn",..., "res"=""}]

以便您可以将结果存储在相应词典的"res"字段中,并跟踪每组参数的性能.

So that you can store your results in the "res" field of the corresponding dictionary and track the performances for each set of parameter.

for set in grid_search:
  #insert here your training and accuracy evaluation using the
  #parameters in set
  
  set["res"]= the_Accuracy_for_HyperPar_in_set

希望对您有帮助.

这篇关于GridSearch使用gensim构建的doc2vec模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆