Solr增量导入的效率方面 [英] Efficiency aspect of delta import in solr

查看:100
本文介绍了Solr增量导入的效率方面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约2100000行的数据.完全导入所花费的时间约为2分钟.对于表中的任何更新,我正在使用增量导入来为更新编制索引.增量导入所花费的时间为6分钟.

I have data of about 2100000 rows. The time taken for full-import is about 2 minutes. For any updates in table I'm using delta import to index the updates. The time taken for delta import is 6 minutes.

考虑到效率方面,最好执行完全导入而不是增量导入.那么,增量导入的需求是什么?有什么更好的方法可以使用增量导入来提高效率?

Considering the efficiency aspect it is better to do full import rather than delta import. So, what is the need of delta import? Is there any better way to use delta import to increase it's efficiency?

我遵循了文档中的步骤.

data-config.xml

<dataConfig>
<dataSource type="JdbcDataSource" driver="com.dbschema.CassandraJdbcDriver" url="jdbc:cassandra://127.0.0.1:9042/test" autoCommit="true" rowLimit = '-1' batchSize="-1"/>
<document name="content">
    <entity name="test" query="SELECT * from person" deltaImportQuery="select * from person where seq=${dataimporter.delta.seq}" deltaQuery="select seq from person where last_modified &gt; '${dataimporter.last_index_time}' ALLOW FILTERING" autoCommit="true">
        <field column="seq" name="id" />
        <field column="last" name="last_s" />
        <field column="first" name="first_s" />
        <field column="city" name="city_s" />
        <field column="zip" name="zip_s" />
        <field column="street" name="street_s" />
        <field column="age" name="age_s" />
        <field column="state" name="state_s" />
        <field column="dollar" name="dollar_s" />
        <field column="pick" name="pick_s" />
    </entity>
</document>

推荐答案

设置增量索引的常用方法(就像您所做的那样)运行2个查询,而不是单个查询.因此,在某些情况下,它可能不是最佳选择.

The usual way of setting up delta indexing (like you did), runs 2 queries instead of a single one. So in some cases it might not be optimal.

我更喜欢设置增量像这样,因此只有一个查询要维护清洁程序,并且增量在单个查询中运行.您应该尝试一下,它可能会有所改善.缺点是删除,您可以进行一些软删除,或者仍然需要常规的增量配置(我赞成第一个).

I prefer to setup delta like this, so there is single query to maintain, it's cleaner, and delta runs in a single query. You should try it, it might improve things. The downside is the deletes, you either do some soft-deleting or you still need the usual delta configuration for that (I favour the first).

当然,还要确保last_modified列已正确索引.我不熟悉Cassandra jdbc驱动程序,应该仔细检查.

Also, of course, make sure the last_modified column is properly indexed. I am not familiar with Cassandra jdbc driver, you should double check.

最后,如果您正在使用Datastax Entreprise Edition,则可以通过Solr对其进行查询(如果已为此进行配置).在这种情况下,您还可以尝试索引

Last thing, if you are using Datastax Entreprise Edition, you can query it via Solr if you configured for that. In this case you could also try indexing off SolrEntityProcessor and with some request param trick you can do full and delta indexing too. I used it succesfully in the past.

这篇关于Solr增量导入的效率方面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆