在Lucene中使用增量索引之后是否应该优化索引? [英] Should an index be optimised after incremental indexes in Lucene?

查看:81
本文介绍了在Lucene中使用增量索引之后是否应该优化索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们每7天对Lucene索引运行一次完整的重新索引(即从头开始创建索引),每2个小时左右运行一次增量索引.我们的索引大约有700,000个文档,一个完整的索引大约需要17个小时(这不是问题).

当我们执行增量索引时,我们仅索引过去两个小时内发生变化的内容,因此所花费的时间要少得多-大约半小时.但是,我们注意到,这段时间中的很多时间(可能是10分钟)都花在了运行IndexWriter.optimize()方法上.

LuceneFAQ 提到:

IndexWriter类支持可优化索引数据库并加快查询速度的optimize()方法.在对文档集执行完整的索引编制或对索引进行增量更新之后,您可能要使用此方法.如果您的增量更新频繁添加文档,则您只希望偶尔执行一次优化,以避免优化的额外开销.

...但是,这似乎并未对频繁"的含义给出任何定义.优化需要占用大量CPU和IO,因此,如果我们可以避免的话,我们宁愿不要这样做.在未优化的索引上运行查询会带来多大的影响(我在考虑完全重新索引后的查询性能,而不是20个增量索引(其中50,000个文档已更改)后的查询性能)?我们应该在每个增量索引之后进行优化,还是对性能造成的损失是不值得的?

解决方案

垫子,因为您似乎很好地了解了当前流程需要花费的时间,所以建议您删除optimize()并评估其影响.

在这2个小时内,许多文档会更改吗?如果仅一小部分(50,000/700,000大约为7%)被重新索引,那么我认为您不会从optimize()中获得太多价值.

一些想法:

  • 完全不执行增量optimize().根据我的经验,无论如何您都不会看到查询方面的巨大改进.
  • 每天执行一次optimize(),而不是每2小时一次.
  • 在小批量时间( javadoc 说).

并确保您进行测量.如果没有这些更改,这些更改可能是黑暗中的一击.

We run full re-indexes every 7 days (i.e. creating the index from scratch) on our Lucene index and incremental indexes every 2 hours or so. Our index has around 700,000 documents and a full index takes around 17 hours (which isn't a problem).

When we do incremental indexes, we only index content that has changed in the past two hours, so it takes much less time - around half an hour. However, we've noticed that a lot of this time (maybe 10 minutes) is spent running the IndexWriter.optimize() method.

The LuceneFAQ mentions that:

The IndexWriter class supports an optimize() method that compacts the index database and speeds up queries. You may want to use this method after performing a complete indexing of your document set or after incremental updates of the index. If your incremental update adds documents frequently, you want to perform the optimization only once in a while to avoid the extra overhead of the optimization.

...but this doesn't seem to give any definition for what "frequently" means. Optimizing is CPU intensive and VERY IO-intensive, so we'd rather not be doing it if we can get away with it. How much is the hit of running queries on an un-optimized index (I'm thinking especially in terms of query performance after a full re-index compared to after 20 incremental indexes where, say, 50,000 documents have changed)? Should we be optimising after every incremental index or is the performance hit not worth it?

解决方案

Mat, since you seem to have a good idea how long your current process takes, I suggest that you remove the optimize() and measure the impact.

Do many of the documents change in those 2 hour windows? If only a small fraction (50,000/700,000 is about 7%) are incrementally re-indexed, then I don't think you are getting much value out of an optimize().

Some ideas:

  • Don't do an incremental optimize() at all. My experience says you are not seeing a huge query improvement anyway.
  • Do the optimize() daily instead of 2-hourly.
  • Do the optimize() during low-volume times (which is what the javadoc says).

And make sure you take measurements. These kinds of changes can be a shot in the dark without them.

这篇关于在Lucene中使用增量索引之后是否应该优化索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆