需要帮助使用 DataImportHandler 将 XML 文件索引到 Solr [英] Need help indexing XML files into Solr using DataImportHandler

查看:25
本文介绍了需要帮助使用 DataImportHandler 将 XML 文件索引到 Solr的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不会Java,不会XML,也不会Lucene.现在那已经不在了.我一直在努力使用 apache solr/lucene 创建一个小项目.我的问题是我无法索引 xml 文件.我想我明白它应该如何工作,但我可能是错的.我不确定您需要什么信息来帮助我,所以我只会发布代码.

<dataSource type="FileDataSource" encoding="UTF-8"/><文档><!-- 第一个实体块将读取 baseDir 中的所有 xml 文件,并将其提供给第二个实体块进行处理.--><entity name="AMMFdir" rootEntity="false" dataSource="null"处理器 =文件列表实体处理器"fileName="^*.xml$" 递归="true"baseDir="C:Documents and SettingssaperezDesktopTomcatapache-tomcat-7.0.23webappssolrdataAMMF_New"><实体处理器="XPathEntityProcessor"名称=AMMF"pk="AcquirerBID"数据源 =AMMFdir"url="${AMMFdir.fileAbsolutePath}"forEach="/AMMF/Merchants/Merchant/"变压器="DateFormatTransformer, RegexTransformer"><field column="AcquirerBID" xpath="/AMMF/Merchants/Merchant/AcquirerBID"/><field column="AcquirerName" xpath="/AMMF/Merchants/Merchant/AcquirerName"/><field column="AcquirerMerchantID" xpath="/AMMF/Merchants/Merchant/AcquirerMerchantID"/></实体></实体></文档>

示例 xml 文件

我在架构中有这个.

<field name="AcquirerName" type="string" indexed="true" stored="true"/><field name="AcquirerMerchantID" type="string" indexed="true" stored="true"/>

我在配置中有这个.

<块引用>

解决方案

示例 XML 格式不正确.这可能解释了索引文件的错误:

$ xmllint sample.xmlsample.xml:13: 解析器错误:预期的>"</商家>^sample.xml:14:解析器错误:标记商家第 3 行中的数据过早结束sample.xml:14: 解析器错误:标记 AMMF 第 2 行中的数据过早结束

更正的 XML

这是我认为您的示例数据应该是什么样子的(没有检查 XSD 文件)

替代解决方案

我知道你说过你不是程序员,但如果你使用 solrj 接口.

以下是一个 groovy 示例,它为您的示例 XML 编制索引

<代码>////依赖//============导入 org.apache.solr.client.solrj.SolrServer导入 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer导入 org.apache.solr.common.SolrInputDocument@葡萄([@Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),])////主要的//=====SolrServer 服务器 = new CommonsHttpSolrServer("http://localhost:8983/solr/");定义 i = 1new File(".").eachFileMatch(~/.*.xml/) {it.withReader { 阅读器 ->def ammf = new XmlSlurper().parse(reader)ammf.Merchants.Merchant.each { 商家 ->SolrInputDocument doc = new SolrInputDocument();doc.addField("id", i++)doc.addField("bid_s", Merchant.AcquirerBID)doc.addField("name_s", Merchant.AcquirerName)doc.addField("merchantId_s", Merchant.AcquirerMerchantID)服务器.添加(文档)}}}服务器提交()

Groovy 是一种不需要编译的 Java 脚本语言.它会像 DIH 配置文件一样易于维护.

I don't know java, I don't know XML, and I don't know Lucene. Now that that's out of the way. I have been working to create a little project using apache solr/lucene. My problem is that I am unable to index the xml files. I think I understand how its supposed to work but I could be wrong. I am not sure what information is required for you to help me so I will just post the code.

<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<!-- This first entity block will read all xml files in baseDir and feed it into the second entity block for handling. -->
<entity name="AMMFdir" rootEntity="false" dataSource="null"
        processor="FileListEntityProcessor"
        fileName="^*.xml$" recursive="true"
        baseDir="C:Documents and SettingssaperezDesktopTomcatapache-tomcat-7.0.23webappssolrdataAMMF_New"
        >
<entity 
        processor="XPathEntityProcessor"
        name="AMMF"
        pk="AcquirerBID"
        datasource="AMMFdir"
        url="${AMMFdir.fileAbsolutePath}"
        forEach="/AMMF/Merchants/Merchant/"
        transformer="DateFormatTransformer, RegexTransformer"
        >

    <field column="AcquirerBID" xpath="/AMMF/Merchants/Merchant/AcquirerBID" />
    <field column="AcquirerName" xpath="/AMMF/Merchants/Merchant/AcquirerName" />
    <field column="AcquirerMerchantID" xpath="/AMMF/Merchants/Merchant/AcquirerMerchantID" />

</entity>
</entity>
</document>

Example xml file

<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
<Merchants Count="153">
    <Merchant ChangeIndicator="A" LocationCountry="840">
    <AcquirerBID>10029881</AcquirerBID>
    <AcquirerName>WorldPay</AcquirerName>
    <AcquirerMerchantID>*</AcquirerMerchantID>
    <Merchant ChangeIndicator="A" LocationCountry="840">
    <AcquirerBID>10029882</AcquirerBID>
    <AcquirerName>WorldPay2</AcquirerName>
    <AcquirerMerchantID>Hello World!</AcquirerMerchantID>
</Merchant>
</Merchants>

I have this in schema.

<field name="AcquirerBID" type="string" indexed="true" stored="true" required="true" /> 
<field name="AcquirerName" type="string" indexed="true" stored="true" />
<field name="AcquirerMerchantID" type="string" indexed="true" stored="true"/>

I have this in config.

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" default="true" >
<lst name="defaults">
<str name="config">AMMFconfig.xml</str>
</lst>
</requestHandler>

解决方案

The sample XML is not well formed. This might explain errors indexing the files:

$ xmllint sample.xml
sample.xml:13: parser error : expected '>'
</Merchants>
          ^
sample.xml:14: parser error : Premature end of data in tag Merchants line 3
sample.xml:14: parser error : Premature end of data in tag AMMF line 2

Corrected XML

Here's what I think your sample data should look like (Didn't check the XSD file)

<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
  <Merchants Count="153">
    <Merchant ChangeIndicator="A" LocationCountry="840">
      <AcquirerBID>10029881</AcquirerBID>
      <AcquirerName>WorldPay</AcquirerName>
      <AcquirerMerchantID>*</AcquirerMerchantID>
    </Merchant>
    <Merchant ChangeIndicator="A" LocationCountry="840">
      <AcquirerBID>10029882</AcquirerBID>
      <AcquirerName>WorldPay2</AcquirerName>
      <AcquirerMerchantID>Hello World!</AcquirerMerchantID>
    </Merchant>
  </Merchants>
</AMMF>

Alternative solution

I know you said you're not a programmer, but this task is significantly simpler, if you use the solrj interface.

The following is a groovy example which indexes your example XML

//
// Dependencies
// ============
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument

@Grapes([
    @Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
])

//
// Main
// =====

SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
def i = 1

new File(".").eachFileMatch(~/.*.xml/) { 

    it.withReader { reader ->
        def ammf = new XmlSlurper().parse(reader)

        ammf.Merchants.Merchant.each { merchant ->
            SolrInputDocument doc = new SolrInputDocument();

            doc.addField("id",           i++)
            doc.addField("bid_s",        merchant.AcquirerBID)
            doc.addField("name_s",       merchant.AcquirerName)
            doc.addField("merchantId_s", merchant.AcquirerMerchantID)

            server.add(doc)
        }
    }

}

server.commit()

Groovy is a Java scripting language that does not require compilation. It would be just as easy to maintain as a DIH config file.

这篇关于需要帮助使用 DataImportHandler 将 XML 文件索引到 Solr的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆