Solr:FileListEntityProcessor多次执行子实体 [英] Solr: FileListEntityProcessor is executing sub entities multiple times
问题描述
我已经配置了dih-import.xml,如下所示. FileListEntityProcessor
遍历一些文件夹,然后为每个文件执行XPathEntity和DB-Entity.
I have configured a dih-import.xml as shown below. The FileListEntityProcessor
walks through some folders and then executes a XPathEntity and a DB-Entity for each file.
当我执行约30.000个文件的完全导入时,导入花费了将近3个小时.回到DIH-debug控制台,它向我显示,对于找到的第一个文件,进行了2次db-call,对于第2个4,然后6、8 ..
When I executed a full import for ~30.000 files, the import took almost 3 hours. Back to the DIH-debug console it showed me, that for the first file that was found 2 db-calls were made, for the 2nd 4, then 6, 8, ..
google没给我看这个主题的任何东西,所以我希望你:)
google didn't show me anything on this subject, so I am hoping for you :)
预先感谢
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource
name="cr-db"
jndiName="xyz"
type="JdbcDataSource" />
<dataSource
name="cr-xml"
type="FileDataSource"
encoding="utf-8" />
<document name="doc">
<entity
dataSource="cr-xml"
name="f"
processor="FileListEntityProcessor"
baseDir="/path/to/xml"
filename="*.xml"
recursive="true"
rootEntity="true"
onError="skip">
<entity
name="xml-data"
dataSource="cr-xml"
processor="XPathEntityProcessor"
forEach="/root"
url="${f.fileAbsolutePath}"
transformer="DateFormatTransformer"
onError="skip">
<field column="id" xpath="/root/id" />
<field column="A" xpath="/root/a" />
</entity>
<entity
name="db-data"
dataSource="cr-db"
query="
SELECT
id, b
FROM
a_table
WHERE
id = '${f.file}'">
<field column="B" name="b" />
</entity>
</entity>
</document>
</dataConfig>
编辑 在google上发现了问题,但还是没有答案: http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html
EDIT found the problem at google, but no answer there either: http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html
和另一个修改
将solr从3.6更新到4.1,并执行了导入程序.问题仍然存在,只是不再有2n(2、4、6、8,..)个对子实体的调用,而只有n个.
updated solr from 3.6 to 4.1 and executed the importer. The problem still exists, only that there are not 2n (2, 4, 6, 8, ..) calls for the sub-entities anymore but only n.
推荐答案
如果主要问题是使用JdbcDataSource时数据库的命中数,则可以尝试切换到
If the main issue is the number of hits on the Database when you use JdbcDataSource, you may try switching to CachedSqlEntityProcessor.
您可能还希望跟踪 SOLR-2943 ,确切地解决您的问题,希望在即将到来的Solr 4.2中解决
You may also want to track SOLR-2943, as they want to address exactly your problem, hopefully for upcoming Solr 4.2
这篇关于Solr:FileListEntityProcessor多次执行子实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!