尝试通过胡萝卜2集群15980个文档时出现Java堆大小错误 [英] Got java heap size error when trying to cluster 15980 documents via carrot2workbench

查看:63
本文介绍了尝试通过胡萝卜2集群15980个文档时出现Java堆大小错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的环境:带有Ubuntu 14.04,Solr 4.3.1,胡萝卜2Workbench 3.10.0的8GB Ram笔记本

My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0

我的Solr索引:15980个文档

My Solr Index: 15980 documents

我的问题:使用kmeans算法对所有文档进行聚类

My Problem: Cluster all documents with the kmeans algorithm

当我放下胡萝卜2工作台中的查询(查询::)时,当使用超过1000个结果时,总是会收到Java堆大小错误.我用-Xms256m -Xmx6g启动了Solr,但是它仍然会发生.

When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000 Results. I started Solr with -Xms256m -Xmx6g but it still occurs.

是堆大小问题还是其他地方?

Is it really a heap size problem or could it be somewhere else?

推荐答案

您的怀疑是正确的,这是堆大小问题,或更确切地说,是可伸缩性约束.直接来自胡萝卜2常见问题解答: http://project.carrot2.org/faq.html#scalability

Your suspicion is correct, it is a heap size problem, or more precisely, a scalability constraint. Straight from the carrot2 FAQs: http://project.carrot2.org/faq.html#scalability

关于文件的数量和长度,Carrot2聚类如何缩放? 要记住的Carrot2算法的最重要特征是它们执行内存中聚类.因此,根据经验,Carrot2应该成功处理多达一千个文档,每个文档包含几段.对于旨在处理数百万个文档的算法,您可能需要检出Mahout项目.

How does Carrot2 clustering scale with respect to the number and length of documents? The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.

开发人员也在此处发布了有关此内容的信息: https://stackoverflow.com/a/28991477

A developer also posted about this here: https://stackoverflow.com/a/28991477

尽管开发人员推荐Mahout,但这可能是可行的方法,因为您将不会像胡萝卜2那样受到内存中群集约束的束缚,但是可能还有其他可能性:

While the developers recommend Mahout, and this is probably the way to go since you would not be bound by the in-memory clustering constraints as in carrot2, there might be other possibilities, though:

  1. 如果您真的很喜欢胡萝卜2,但不一定需要k均值,则可以基于将100000个代码片段[s]聚类的时间"字段和(*** )在 http://carrotsearch.com/lingo3g-comparison 上备注,它应该能够处理更多文档.还要在"Lingo3G可以群集的最大文档数"上检查其常见问题解答条目.在 http://carrotsearch.com/lingo3g-faq

  1. If you really like carrot2 but do not necessarily need k-means, you could take a look at the commercial Lingo3G, based on the "Time of clustering 100000 snippets [s] " field and the (***) remark on http://carrotsearch.com/lingo3g-comparison it should be able to tackle more documents. Check also their FAQ entry on "What is the maximum number of documents Lingo3G can cluster?" on http://carrotsearch.com/lingo3g-faq

尝试最小化k均值在其上执行聚类的标签的大小.与其对所有文档内容进行聚类,不如对摘要/摘要进行聚类,或者提取重要的关键字并对其进行聚类.

Try to minimize the size of your labels on which k-means is performing the clustering. Instead of clustering over all the documents content, try to cluster on the abstract/summary or extract important keywords and cluster on them.

这篇关于尝试通过胡萝卜2集群15980个文档时出现Java堆大小错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆