Java-MongoDB + Solr性能 [英] java - MongoDB + Solr performances

查看:56
本文介绍了Java-MongoDB + Solr性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在四处寻找如何将MongoDB与Solr结合使用,这里的一些问题有部分回答,但是没有什么真正具体的(更像理论).在我的应用程序中,我将在MongoDB中存储大量文档(可能多达几亿个),并且我想对这些文档的某些属性实施全文搜索,所以我想Solr是最好的方法这个.

I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.

我想知道的是我应该如何配置/执行所有操作以使其具有良好的性能?现在,这是我的工作(我知道这不是最佳选择):

What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):

1-在MongoDB中插入对象时,我将其添加到Solr

1- When inserting an object in MongoDB, I then add it to Solr

SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();

2-更新对象的属性时,由于Solr不能仅更新一个字段,因此首先从MongoDB中检索对象,然后使用对象和新属性中的所有属性更新Solr索引,并执行类似的操作

2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like

StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();

3-查询时,首先查询Solr,然后检索文档列表SolrDocumentList,我遍历每个文档,然后:

3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList I go through each document and:

  1. 获取文档ID
  2. 从MongoDB获取具有相同ID的对象,以便能够从那里检索属性

4-删除时,我还没做完那部分,还不确定如何用Java做到这一点

4- When deleting, well I haven't done that part yet and not really sure how to do it in Java

因此,对于在此描述的每个方案,有人对如何以更有效的方式执行此操作有建议吗?是否喜欢这样的过程,即在Solr中有很多文档并一次添加一个文档时,无需花费1个小时即可重建索引?我在这里的要求是用户可能希望一次,多次添加一个文档,我希望他们能够在之后立即检索该文档.

So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after

推荐答案

您的方法实际上是好的.一些流行的框架(例如Compass)正在较低级别执行您描述的内容,以自动反映通过ORM框架执行的索引更改(请参见

Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).

除了您要描述的内容之外,我还将定期重新索引MongoDB中的所有数据,以确保Solr和Mongo都同步(可能不像您想的那样长,具体取决于数字文档,字段数,每个字段的令牌数以及分析器的性能:我经常在不到15分钟的时间内使用复杂的分析器创建5到8百万个文档的索引(大约20个字段,但文本字段很短) ,只需确保您的RAM缓冲区不是太小,并且在添加所有文档之前不要提交/优化).

In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).

关于性能,提交是昂贵的,而优化是非常昂贵的.根据您最紧要的情况,可以在Solrconfig.xml中更改mergefactor的值(高值可以提高写入性能,而低值可以提高读取性能,开始时10是一个很好的值).

Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).

您似乎担心索引建立时间.但是,由于Lucene索引存储是基于段的,因此写吞吐量不应过多地依赖于索引的大小(http://lucene.apache.org/java/2_3_2/fileformats.html).但是,预热时间会增加,因此您应确保

You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆