Solr:FileListEntityProcessor多次执行子实体 [英] Solr: FileListEntityProcessor is executing sub entities multiple times

查看:93
本文介绍了Solr:FileListEntityProcessor多次执行子实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经配置了dih-import.xml,如下所示. FileListEntityProcessor遍历一些文件夹,然后为每个文件执行XPathEntity和DB-Entity.

I have configured a dih-import.xml as shown below. The FileListEntityProcessor walks through some folders and then executes a XPathEntity and a DB-Entity for each file.

当我执行约30.000个文件的完全导入时,导入花费了将近3个小时.回到DIH-debug控制台,它向我显示,对于找到的第一个文件,进行了2次db-call,对于第2个4,然后6、8 ..

When I executed a full import for ~30.000 files, the import took almost 3 hours. Back to the DIH-debug console it showed me, that for the first file that was found 2 db-calls were made, for the 2nd 4, then 6, 8, ..

google没给我看这个主题的任何东西,所以我希望你:)

google didn't show me anything on this subject, so I am hoping for you :)

预先感谢

<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
    <dataSource 
        name="cr-db"
        jndiName="xyz"
        type="JdbcDataSource" />
    <dataSource 
        name="cr-xml" 
        type="FileDataSource" 
        encoding="utf-8" />


    <document name="doc">
        <entity 
            dataSource="cr-xml" 
            name="f" 
            processor="FileListEntityProcessor" 
            baseDir="/path/to/xml" 
            filename="*.xml" 
            recursive="true" 
            rootEntity="true" 
            onError="skip">
            <entity
                name="xml-data" 
                dataSource="cr-xml" 
                processor="XPathEntityProcessor" 
                forEach="/root" 
                url="${f.fileAbsolutePath}" 
                transformer="DateFormatTransformer" 
                onError="skip">
                <field column="id" xpath="/root/id" /> 

                <field column="A" xpath="/root/a" />
            </entity>

            <entity 
                name="db-data" 
                dataSource="cr-db"
                query="
                    SELECT  
                        id, b
                    FROM 
                        a_table
                    WHERE 
                        id = '${f.file}'">
                <field column="B" name="b" />
            </entity>
        </entity>
    </document>
</dataConfig>


编辑 在google上发现了问题,但还是没有答案: http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html


EDIT found the problem at google, but no answer there either: http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html

和另一个修改

将solr从3.6更新到4.1,并执行了导入程序.问题仍然存在,只是不再有2n(2、4、6、8,..)个对子实体的调用,而只有n个.

updated solr from 3.6 to 4.1 and executed the importer. The problem still exists, only that there are not 2n (2, 4, 6, 8, ..) calls for the sub-entities anymore but only n.

推荐答案

如果主要问题是使用JdbcDataSource时数据库的命中数,则可以尝试切换到

If the main issue is the number of hits on the Database when you use JdbcDataSource, you may try switching to CachedSqlEntityProcessor.

您可能还希望跟踪 SOLR-2943 ,确切地解决您的问题,希望在即将到来的Solr 4.2中解决

You may also want to track SOLR-2943, as they want to address exactly your problem, hopefully for upcoming Solr 4.2

这篇关于Solr:FileListEntityProcessor多次执行子实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆