Gensim等同于培训步骤 [英] Gensim equivalent of training steps

查看:81
本文介绍了Gensim等同于培训步骤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

gensim Word2Vec是否具有与TensorFlow word2vec示例中的训练步骤"等效的选项,请参见:

Does gensim Word2Vec have an option that is the equivalent of "training steps" in the TensorFlow word2vec example here: Word2Vec Basic? If not, what default value does gensim use? Is the gensim parameter iter related to training steps?

TensorFlow脚本包含此部分.

The TensorFlow script includes this section.

with tf.Session(graph=graph) as session:
    # We must initialize all variables before we use them.
    init.run()
    print('Initialized')

    average_loss = 0
    for step in xrange(num_steps):
        batch_inputs, batch_labels = generate_batch(
            batch_size, num_skips, skip_window)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

    # We perform one update step by evaluating the optimizer op (including it
    # in the list of returned values for session.run()
    _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 2000 == 0:
        if step > 0:
            average_loss /= 2000
        # The average loss is an estimate of the loss over the last 2000 batches.
        print('Average loss at step ', step, ': ', average_loss)
        average_loss = 0

    # Note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
        sim = similarity.eval()
        for i in xrange(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8  # number of nearest neighbors
            nearest = (-sim[i, :]).argsort()[1:top_k + 1]
            log_str = 'Nearest to %s:' % valid_word
            for k in xrange(top_k):
                close_word = reverse_dictionary[nearest[k]]
                log_str = '%s %s,' % (log_str, close_word)
            print(log_str)
  final_embeddings = normalized_embeddings.eval()

在TensorFlow示例中,如果我对嵌入执行T-SNE并使用matplotlib进行绘制,则在步骤数很多时,该绘制对我来说看起来更合理. 我正在使用一小部分1200封电子邮件.看起来更合理的一种方法是将数字聚集在一起.我想使用gensim达到相同的外观质量水平.

In the TensorFlow example, if I perform T-SNE on the embeddings and plot with matplotlib, the plot looks more reasonable to me when the number of steps is high. I am using a small corpus of 1,200 emails. One way it looks more reasonable is that numbers are clustered together. I would like to attain the same apparent level of quality using gensim.

推荐答案

是的, Word2Vec 类构造函数具有iter参数:

iter =语料库上的迭代次数(时期).默认值为5.

iter = number of iterations (epochs) over the corpus. Default is 5.

此外,如果您致电 Word2Vec.train() 方法,您可以传入具有相同含义的epochs参数.

Also, if you call Word2Vec.train() method directly, you can pass in epochs argument that has the same meaning.

实际训练步骤的数量是根据时期得出的,但取决于 其他参数,例如文本大小,窗口大小和批处理大小.如果您只是想提高嵌入矢量的质量,那么增加iter是正确的方法.

The number of actual training steps is deduced from epochs, but depends on other parameters like text size, window size and batch size. If you're just looking to improve the quality of embedding vectors, increasing iter is the right way.

这篇关于Gensim等同于培训步骤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆