doc2vec取得良好性能所需的最小数据集大小是多少? [英] what is the minimum dataset size needed for good performance with doc2vec?

查看:378
本文介绍了doc2vec取得良好性能所需的最小数据集大小是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在不同大小的数据集上训练时doc2vec的性能如何?在原始语料库中没有提及数据集大小,因此我想知道从doc2vec中获得良好性能所需的最小大小是多少.

How does doc2vec perform when trained on different sized datasets? There is no mention of dataset size in the original corpus, so I am wondering what is the minimum size required to get good performance out of doc2vec.

推荐答案

一堆东西被称为"doc2vec",但它似乎通常是指Le和Mikolov的段落向量"技术.

A bunch of things have been called 'doc2vec', but it seems to most-often refer to the 'Paragraph Vector' technique from Le and Mikolov.

原始的段落向量"论文描述了对三个数据集的评估:

The original 'Paragraph Vector' paper describes evaluating it on three datasets:

  • 斯坦福情感树库":11,825个电影评论的句子(又分为239,232个片段短语,每个短语几句话)
  • "IMDB数据集":100,000个电影评论(通常每个评论数百个字)
  • 搜索结果代码段"段落:10,000,000个段落,是从Google排名前10位的搜索结果中(针对每1,000,000个最常见查询)收集的

前两个是公开可用的,因此您还可以查看它们的总大小,包括单词,典型文档大小和词汇量. (不过,请注意,没有人能够在前两个数据集中的任何一个上完全复制该论文的情感分类结果,这意味着它们的报告中缺少某些信息或错误.有可能在IMDB数据集上接近.)

The 1st two are publicly available, so you can also review their total sizes in words, typical document sizes, and vocabularies. (Note, though, that no one has been able to fully-reproduce that paper's sentiment-classification results on either of those first two datasets, implying some missing info or error in their reporting. It's possible to get close on the IMDB dataset.)

后续论文将算法应用于发现数据集中的主题关系:

A followup paper applied the algorithm to discovering topical-relationships in the datasets:

  • 维基百科:4,490,000条文章正文
  • Arxiv:从PDF中提取的886,000篇学术论文文本

因此,在这两篇早期论文中使用的语料库从数万到数百万个文档不等,文档大小从几个词组到成千上万个文章不等. (但是这些作品不一定会混合大小各异的文档.)

So the corpuses used in those two early papers ranged from tens-of-thousands to millions of documents, and document sizes from a few word phrases to thousands-of-word articles. (But those works did not necessarily mix wildly-differently-sized documents.)

通常,word2vec/段落向量技术会受益于大量数据和各种单词上下文.如果没有至少成千上万的文档,我不会期望有好的结果.文件长于几个字,每个文件的效果要好得多.如果在相同的培训中混合了大小不等或种类不同的文档(例如,混合推文和书籍),则可能难以解释结果.

In general, word2vec/paragraph-vector techniques benefit from a lot of data and variety of word-contexts. I wouldn't expect good results without at least tens-of-thousands of documents. Documents longer than a few words each work much better. Results may be harder to interpret if wildly-different-in-size or -kind documents are mixed in the same training – such as mixing tweets and books.

但是您实际上必须根据您的语料库和目标进行评估,因为出于某些目的处理某些数据的内容可能无法推广到非常不同的项目.

But you really have to evaluate it with your corpus and goals, because what works with some data, for some purposes, may not be generalizable to very-different projects.

这篇关于doc2vec取得良好性能所需的最小数据集大小是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆