Java-MongoDB + Solr性能 [英] java - MongoDB + Solr performances
问题描述
我一直在四处寻找如何将MongoDB与Solr结合使用,这里的一些问题有部分回答,但是没有什么真正具体的(更像理论).在我的应用程序中,我将在MongoDB中存储大量文档(可能多达几亿个),并且我想对这些文档的某些属性实施全文搜索,所以我想Solr是最好的方法这个.
I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.
我想知道的是我应该如何配置/执行所有操作以使其具有良好的性能?现在,这是我的工作(我知道这不是最佳选择):
What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):
1-在MongoDB中插入对象时,我将其添加到Solr
1- When inserting an object in MongoDB, I then add it to Solr
SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();
2-更新对象的属性时,由于Solr不能仅更新一个字段,因此首先从MongoDB中检索对象,然后使用对象和新属性中的所有属性更新Solr索引,并执行类似的操作
2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like
StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();
3-查询时,首先查询Solr,然后检索文档列表SolrDocumentList
,我遍历每个文档,然后:
3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList
I go through each document and:
- 获取文档ID
- 从MongoDB获取具有相同ID的对象,以便能够从那里检索属性
4-删除时,我还没做完那部分,还不确定如何用Java做到这一点
4- When deleting, well I haven't done that part yet and not really sure how to do it in Java
因此,对于在此描述的每个方案,有人对如何以更有效的方式执行此操作有建议吗?是否喜欢这样的过程,即在Solr中有很多文档并一次添加一个文档时,无需花费1个小时即可重建索引?我在这里的要求是用户可能希望一次,多次添加一个文档,我希望他们能够在之后立即检索该文档.
So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after
推荐答案
您的方法实际上是好的.一些流行的框架(例如Compass)正在较低级别执行您描述的内容,以自动反映通过ORM框架执行的索引更改(请参见
Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).
除了您要描述的内容之外,我还将定期重新索引MongoDB中的所有数据,以确保Solr和Mongo都同步(可能不像您想的那样长,具体取决于数字文档,字段数,每个字段的令牌数以及分析器的性能:我经常在不到15分钟的时间内使用复杂的分析器创建5到8百万个文档的索引(大约20个字段,但文本字段很短) ,只需确保您的RAM缓冲区不是太小,并且在添加所有文档之前不要提交/优化).
In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).
关于性能,提交是昂贵的,而优化是非常昂贵的.根据您最紧要的情况,可以在Solrconfig.xml中更改mergefactor的值(高值可以提高写入性能,而低值可以提高读取性能,开始时10是一个很好的值).
Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).
您似乎担心索引建立时间.但是,由于Lucene索引存储是基于段的,因此写吞吐量不应过多地依赖于索引的大小(http://lucene.apache.org/java/2_3_2/fileformats.html).但是,预热时间会增加,因此您应确保
You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that
- 在solrconfig.xml配置文件中的firstSearcher和newSearcher参数中有典型的查询(特别是为了加载字段缓存而进行的排序),
- useColdSearcher设置为
-
为了获得良好的搜索效果,
- false,或者
- 如果希望以较慢的搜索速度来更快地考虑对索引执行的更改,则为true.
- there are typical (especially for sorts in order to load the fieldcaches) but not too complex queries in the firstSearcher and newSearcher parameters in your solrconfig.xml config file,
- useColdSearcher is set to
- false in order to have good search performance, or
- true if you want changes performed to the index to be taken faster into account at the price of a slower search.
此外,如果在将数据写入MongoDB后仅X毫秒后才可以搜索数据,如果您可以接受,则可以使用UpdateHandler的commitWithin功能.这样,Solr不必再频繁提交.
Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.
有关Solr性能因子的更多信息,请参阅 http://wiki.apache.org/solr/SolrPerformanceFactors
For more information about Solr performance factors, see http://wiki.apache.org/solr/SolrPerformanceFactors
要删除文档,您可以按文档ID(在schema.xml中定义)或查询进行删除: http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html
To delete documents, you can either delete by document ID (as defined in schema.xml) or by query : http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html
这篇关于Java-MongoDB + Solr性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!