尝试通过胡萝卜2集群15980个文档时出现Java堆大小错误 [英] Got java heap size error when trying to cluster 15980 documents via carrot2workbench

查看：63 发布时间：2021/2/15 19:03:21 solr cluster-analysis k-means workbench carrot

本文介绍了尝试通过胡萝卜2集群15980个文档时出现Java堆大小错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的环境:带有Ubuntu 14.04，Solr 4.3.1，胡萝卜2Workbench 3.10.0的8GB Ram笔记本

My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0

我的Solr索引:15980个文档

My Solr Index: 15980 documents

我的问题:使用kmeans算法对所有文档进行聚类

My Problem: Cluster all documents with the kmeans algorithm

当我放下胡萝卜2工作台中的查询(查询::)时，当使用超过1000个结果时，总是会收到Java堆大小错误.我用-Xms256m -Xmx6g启动了Solr，但是它仍然会发生.

When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000 Results. I started Solr with -Xms256m -Xmx6g but it still occurs.

是堆大小问题还是其他地方?

Is it really a heap size problem or could it be somewhere else?

推荐答案

您的怀疑是正确的，这是堆大小问题，或更确切地说，是可伸缩性约束.直接来自胡萝卜2常见问题解答: http://project.carrot2.org/faq.html#scalability

Your suspicion is correct, it is a heap size problem, or more precisely, a scalability constraint. Straight from the carrot2 FAQs: http://project.carrot2.org/faq.html#scalability

关于文件的数量和长度，Carrot2聚类如何缩放? 要记住的Carrot2算法的最重要特征是它们执行内存中聚类.因此，根据经验，Carrot2应该成功处理多达一千个文档，每个文档包含几段.对于旨在处理数百万个文档的算法，您可能需要检出Mahout项目.

How does Carrot2 clustering scale with respect to the number and length of documents? The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.

开发人员也在此处发布了有关此内容的信息: https://stackoverflow.com/a/28991477

A developer also posted about this here: https://stackoverflow.com/a/28991477

尽管开发人员推荐Mahout，但这可能是可行的方法，因为您将不会像胡萝卜2那样受到内存中群集约束的束缚，但是可能还有其他可能性:

While the developers recommend Mahout, and this is probably the way to go since you would not be bound by the in-memory clustering constraints as in carrot2, there might be other possibilities, though:

如果您真的很喜欢胡萝卜2，但不一定需要k均值，则可以基于将100000个代码片段[s]聚类的时间"字段和(*** )在 http://carrotsearch.com/lingo3g-comparison 上备注，它应该能够处理更多文档.还要在"Lingo3G可以群集的最大文档数"上检查其常见问题解答条目.在 http://carrotsearch.com/lingo3g-faq

If you really like carrot2 but do not necessarily need k-means, you could take a look at the commercial Lingo3G, based on the "Time of clustering 100000 snippets [s] " field and the (***) remark on http://carrotsearch.com/lingo3g-comparison it should be able to tackle more documents. Check also their FAQ entry on "What is the maximum number of documents Lingo3G can cluster?" on http://carrotsearch.com/lingo3g-faq

尝试最小化k均值在其上执行聚类的标签的大小.与其对所有文档内容进行聚类，不如对摘要/摘要进行聚类，或者提取重要的关键字并对其进行聚类.

Try to minimize the size of your labels on which k-means is performing the clustering. Instead of clustering over all the documents content, try to cluster on the abstract/summary or extract important keywords and cluster on them.

这篇关于尝试通过胡萝卜2集群15980个文档时出现Java堆大小错误的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

尝试通过胡萝卜2集群15980个文档时出现Java堆大小错误 [英] Got java heap size error when trying to cluster 15980 documents via carrot2workbench

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

尝试通过胡萝卜2集群15980个文档时出现Java堆大小错误 [英] Got java heap size error when trying to cluster 15980 documents via carrot2workbench

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭