Carrot2工作台无法处理大数据 [英] Carrot2 workbench not able to process large data
问题描述
我想使用胡萝卜2工作台对数据集进行聚类。我有一个包含 65536
文档的xml输入文件。我正在使用Lingo聚类算法。
I wanted to cluster my data-set using carrot2 workbench. I have an input xml file with 65536
documents. I am using Lingo clustering algorithm.
但是,当我开始该过程时,工作台将在几秒钟内将所有文档归入其他主题集群,并返回结果。
But when I start the process, the workbench returns the result within few seconds having all the documents in the "other topics" cluster.
我检查了具有较小数据集的聚类,并且得到了结果。
I have checked the clustering with smaller data-sets and I am getting the results.
推荐答案
Carrot2 Lingo算法是为小数据集(最多一千个文档)设计的。对于较大的数据集,您可能需要尝试STC,这样可以更好地扩展。
Carrot2 Lingo algorithm was designed for small data sets, up to a thousand or so of documents. For larger data sets, you may want to try STC, which scales better.
无论采用哪种算法,Carrot2都会在内存中处理所有数据,因此不会扩展到数百万个文档。在后一种情况下,您可能想要查看 Apache Mahout 。
Regardless of the algorithm, Carrot2 processes all data in-memory, so it will not scale to millions of documents. In the latter case you may want to look at Apache Mahout, for example.
这篇关于Carrot2工作台无法处理大数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!