java - MongoDB + Solr 性能 [英] java - MongoDB + Solr performances

查看:16
本文介绍了java - MongoDB + Solr 性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在环顾四周,想了解如何将 MongoDB 与 Solr 结合使用,这里的一些问题有部分答案,但没有真正具体的(更像是理论).在我的应用程序中,我将有大量的文档存储在 MongoDB 中(可能多达几亿),并且我想对这些文档的某些属性进行全文搜索,所以我想 Solr 是最好的方法这个.

I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.

我想知道的是我应该如何配置/执行所有内容以使其具有良好的性能?现在,这就是我所做的(我知道它不是最佳的):

What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):

1- 在 MongoDB 中插入对象时,我将其添加到 Solr

1- When inserting an object in MongoDB, I then add it to Solr

SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();

2- 当更新对象的属性时,由于 Solr 不能只更新一个字段,首先我从 MongoDB 中检索对象,然后使用来自对象和新属性的所有属性更新 Solr 索引并执行类似的操作

2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like

StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();

3- 查询时,首先查询 Solr,然后在检索文档列表时 SolrDocumentList 我遍历每个文档并:

3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList I go through each document and:

  1. 获取文档的id
  2. 从 MongoDB 获取具有相同 id 的对象,以便能够从那里检索属性

4- 删除时,我还没有完成那部分,也不确定如何在 Java 中进行

4- When deleting, well I haven't done that part yet and not really sure how to do it in Java

那么对于这里描述的每个场景,有人对如何以更有效的方式执行此操作有任何建议吗?喜欢在 Solr 中有很多文档并一次添加一个文档时,以一种不需要 1 小时的时间来重建索引的过程吗?我的要求是用户可能希望一次添加一个文档,多次添加,我希望他们能够在之后立即检索它

So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after

推荐答案

你的方法其实很好.一些流行的框架如 Compass 正在执行您在较低级别描述的内容,以便自动反映通过 ORM 框架执行的索引更改(请参阅 http://www.compass-project.org/overview.html).

Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).

除了您描述的内容之外,我还会定期重新索引 MongoDB 中的所有数据,以确保 Solr 和 Mongo 同步(可能没有您想象的那么长,具体取决于数量文档的数量、字段的数量、每个字段的标记数量以及分析器的性能:我经常使用复杂的分析器在不到 15 分钟的时间内创建 5 到 8 百万个文档(大约 20 个字段,但文本字段很短)的索引,只需确保您的 RAM 缓冲区不会太小,并且在添加所有文档之前不要提交/优化).

In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).

关于性能,提交成本很高,优化成本非常高.根据对您而言最重要的内容,您可以更改 Solrconfig.xml 中合并因子的值(高值可提高写入性能,而低值可提高读取性能,从 10 开始是一个不错的值).

Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).

您似乎害怕索引构建时间.但是,由于 Lucene 索引存储是基于段的,因此写入吞吐量不应过多依赖于索引的大小 (http://lucene.apache.org/java/2_3_2/fileformats.html).但是,预热时间会增加,因此您应该确保

You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that

  • 在您的 solrconfig.xml 配置文件中的 firstSearcher 和 newSearcher 参数中有典型的(特别是为了加载字段缓存的排序)但不太复杂的查询,
  • useColdSearcher 设置为
    • false 以获得良好的搜索性能,或
    • 如果您希望以较慢的搜索为代价更快地考虑对索引执行的更改,则为 true.

    此外,如果您可以接受数据在写入 MongoDB 后仅几 X 毫秒就可搜索,您可以使用 UpdateHandler 的 commitWithin 功能.这样,Solr 就不必经常提交了.

    Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.

    有关 Solr 性能因素的更多信息,请参阅http://wiki.apache.org/solr/SolrPerformanceFactors

    For more information about Solr performance factors, see http://wiki.apache.org/solr/SolrPerformanceFactors

    要删除文档,您可以按文档 ID(在 schema.xml 中定义)或按查询删除:http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html

    To delete documents, you can either delete by document ID (as defined in schema.xml) or by query : http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html

    这篇关于java - MongoDB + Solr 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆