需要帮助使用 DataImportHandler 将 XML 文件索引到 Solr [英] Need help indexing XML files into Solr using DataImportHandler
问题描述
我不会Java,不会XML,也不会Lucene.现在那已经不在了.我一直在努力使用 apache solr/lucene 创建一个小项目.我的问题是我无法索引 xml 文件.我想我明白它应该如何工作,但我可能是错的.我不确定您需要什么信息来帮助我,所以我只会发布代码.
<dataSource type="FileDataSource" encoding="UTF-8"/><文档><!-- 第一个实体块将读取 baseDir 中的所有 xml 文件,并将其提供给第二个实体块进行处理.--><entity name="AMMFdir" rootEntity="false" dataSource="null"处理器 =文件列表实体处理器"fileName="^*.xml$" 递归="true"baseDir="C:Documents and SettingssaperezDesktopTomcatapache-tomcat-7.0.23webappssolrdataAMMF_New"><实体处理器="XPathEntityProcessor"名称=AMMF"pk="AcquirerBID"数据源 =AMMFdir"url="${AMMFdir.fileAbsolutePath}"forEach="/AMMF/Merchants/Merchant/"变压器="DateFormatTransformer, RegexTransformer"><field column="AcquirerBID" xpath="/AMMF/Merchants/Merchant/AcquirerBID"/><field column="AcquirerName" xpath="/AMMF/Merchants/Merchant/AcquirerName"/><field column="AcquirerMerchantID" xpath="/AMMF/Merchants/Merchant/AcquirerMerchantID"/></实体></实体></文档>
示例 xml 文件
我在架构中有这个.
<field name="AcquirerName" type="string" indexed="true" stored="true"/><field name="AcquirerMerchantID" type="string" indexed="true" stored="true"/>
我在配置中有这个.
<块引用>
示例 XML 格式不正确.这可能解释了索引文件的错误:
$ xmllint sample.xmlsample.xml:13: 解析器错误:预期的>"</商家>^sample.xml:14:解析器错误:标记商家第 3 行中的数据过早结束sample.xml:14: 解析器错误:标记 AMMF 第 2 行中的数据过早结束
更正的 XML
这是我认为您的示例数据应该是什么样子的(没有检查 XSD 文件)
替代解决方案
我知道你说过你不是程序员,但如果你使用 solrj 接口.
以下是一个 groovy 示例,它为您的示例 XML 编制索引
<代码>////依赖//============导入 org.apache.solr.client.solrj.SolrServer导入 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer导入 org.apache.solr.common.SolrInputDocument@葡萄([@Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),])////主要的//=====SolrServer 服务器 = new CommonsHttpSolrServer("http://localhost:8983/solr/");定义 i = 1new File(".").eachFileMatch(~/.*.xml/) {it.withReader { 阅读器 ->def ammf = new XmlSlurper().parse(reader)ammf.Merchants.Merchant.each { 商家 ->SolrInputDocument doc = new SolrInputDocument();doc.addField("id", i++)doc.addField("bid_s", Merchant.AcquirerBID)doc.addField("name_s", Merchant.AcquirerName)doc.addField("merchantId_s", Merchant.AcquirerMerchantID)服务器.添加(文档)}}}服务器提交()
Groovy 是一种不需要编译的 Java 脚本语言.它会像 DIH 配置文件一样易于维护.
I don't know java, I don't know XML, and I don't know Lucene. Now that that's out of the way. I have been working to create a little project using apache solr/lucene. My problem is that I am unable to index the xml files. I think I understand how its supposed to work but I could be wrong. I am not sure what information is required for you to help me so I will just post the code.
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<!-- This first entity block will read all xml files in baseDir and feed it into the second entity block for handling. -->
<entity name="AMMFdir" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor"
fileName="^*.xml$" recursive="true"
baseDir="C:Documents and SettingssaperezDesktopTomcatapache-tomcat-7.0.23webappssolrdataAMMF_New"
>
<entity
processor="XPathEntityProcessor"
name="AMMF"
pk="AcquirerBID"
datasource="AMMFdir"
url="${AMMFdir.fileAbsolutePath}"
forEach="/AMMF/Merchants/Merchant/"
transformer="DateFormatTransformer, RegexTransformer"
>
<field column="AcquirerBID" xpath="/AMMF/Merchants/Merchant/AcquirerBID" />
<field column="AcquirerName" xpath="/AMMF/Merchants/Merchant/AcquirerName" />
<field column="AcquirerMerchantID" xpath="/AMMF/Merchants/Merchant/AcquirerMerchantID" />
</entity>
</entity>
</document>
Example xml file
<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
<Merchants Count="153">
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029881</AcquirerBID>
<AcquirerName>WorldPay</AcquirerName>
<AcquirerMerchantID>*</AcquirerMerchantID>
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029882</AcquirerBID>
<AcquirerName>WorldPay2</AcquirerName>
<AcquirerMerchantID>Hello World!</AcquirerMerchantID>
</Merchant>
</Merchants>
I have this in schema.
<field name="AcquirerBID" type="string" indexed="true" stored="true" required="true" />
<field name="AcquirerName" type="string" indexed="true" stored="true" />
<field name="AcquirerMerchantID" type="string" indexed="true" stored="true"/>
I have this in config.
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" default="true" > <lst name="defaults"> <str name="config">AMMFconfig.xml</str> </lst> </requestHandler>
The sample XML is not well formed. This might explain errors indexing the files:
$ xmllint sample.xml
sample.xml:13: parser error : expected '>'
</Merchants>
^
sample.xml:14: parser error : Premature end of data in tag Merchants line 3
sample.xml:14: parser error : Premature end of data in tag AMMF line 2
Corrected XML
Here's what I think your sample data should look like (Didn't check the XSD file)
<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
<Merchants Count="153">
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029881</AcquirerBID>
<AcquirerName>WorldPay</AcquirerName>
<AcquirerMerchantID>*</AcquirerMerchantID>
</Merchant>
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029882</AcquirerBID>
<AcquirerName>WorldPay2</AcquirerName>
<AcquirerMerchantID>Hello World!</AcquirerMerchantID>
</Merchant>
</Merchants>
</AMMF>
Alternative solution
I know you said you're not a programmer, but this task is significantly simpler, if you use the solrj interface.
The following is a groovy example which indexes your example XML
//
// Dependencies
// ============
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument
@Grapes([
@Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
])
//
// Main
// =====
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
def i = 1
new File(".").eachFileMatch(~/.*.xml/) {
it.withReader { reader ->
def ammf = new XmlSlurper().parse(reader)
ammf.Merchants.Merchant.each { merchant ->
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", i++)
doc.addField("bid_s", merchant.AcquirerBID)
doc.addField("name_s", merchant.AcquirerName)
doc.addField("merchantId_s", merchant.AcquirerMerchantID)
server.add(doc)
}
}
}
server.commit()
Groovy is a Java scripting language that does not require compilation. It would be just as easy to maintain as a DIH config file.
这篇关于需要帮助使用 DataImportHandler 将 XML 文件索引到 Solr的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!