Solr DataImportHandler没有索引所有定义的数据 [英] Solr DataImportHandler is not indexing all data defined

查看:124
本文介绍了Solr DataImportHandler没有索引所有定义的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用solr5.3.

I am using solr5.3.

我正在尝试上传维基百科页面文章 使用"DataImportHandler"转储到solr,但是查询时我仅获得id和标题文件.

I am trying to upload wikipedia page article dump to solr using "DataImportHandler" but I am getting only id and title files when i am querying.

下面是我的data-config.xml

Below is my data-config.xml

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/mnt/TEST/enwiki-20150602-pages-articles1.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

我还添加了以下内容到schema.xml.

Also I have added below entires to schema.xml.

 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="title"     type="string"  indexed="true" stored="false"/>
    <field name="revision"  type="int"    indexed="true" stored="true"/>
    <field name="user"      type="string"  indexed="true" stored="true"/>
    <field name="userId"    type="int"     indexed="true" stored="true"/>
    <field name="text"      type="text_en"    indexed="true" stored="false"/>
    <field name="timestamp" type="date"    indexed="true" stored="true"/>
    <field name="titleText" type="text_en"    indexed="true" stored="true"/>

我已从"example/example-DIH/solr/solr/conf/schema.xml"中复制了schema.xml,并删除了所有字段条目,但注释中提到的例外情况很少.

I have copied schema.xml from "example/example-DIH/solr/solr/conf/schema.xml" and removed all field entries with few exceptions as mentioned in comments.

导入数据后,我只是尝试获取所有字段,但仅获得"Id"和"Title".

After importing data I am just trying to fetch all fields but I am getting only "Id" and "Title".

我也尝试使用调试模式运行documentImport,以便获得有关索引的一些信息,但是每当我选择调试模式时,它仅导入2个文档.我不确定为什么吗?由于这个原因,我无法调试索引过程.

Also I tried to run documentImport using debug mode so that I can get some information regarding indexing, but at whenever i am selecting debug mode it is only importing 2 documents. I am not sure why? Due to this reason I am not able to debug the indexing process.

请进一步指导我.

编辑-我现在确定其他字段没有被索引,因为当我指定df = user或text时,我得到了下面的消息.

EDIT-I am now sure that other fields are not getting indexed because when I am specifying df=user or text, I am getting below message.

"msg":未定义字段用户",

"msg": "undefined field user",

我正在如下查询: http://localhost:8983/solr/wiki/select?q = %3A & fl = id%2Ctitle%2Ctext%2Crevision& wt = json& indent = true& debugQuery = true

I am querying like below: http://localhost:8983/solr/wiki/select?q=%3A&fl=id%2Ctitle%2Ctext%2Crevision&wt=json&indent=true&debugQuery=true

推荐答案

提供的设置仅适用于经典架构.但是默认情况下,在solrconfig中启用了托管模式.由于这个原因,我没有收到短信.对于托管模式,我不需要定义"schema.xml",而应在data-config.xml中定义字段,如下所示.

The provided setting will work fine with classic schema only. But at solrconfig by default managed schema was enabled. Due to which I was not getting text. For managed schema I need not to define "schema.xml" and I should define fields in data-config.xml like below.

 <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title_s"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user_s"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text_s"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>

这篇关于Solr DataImportHandler没有索引所有定义的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆