如何从HTML文件中提取元标记并在SOLR和TIKA中对其进行索引 [英] How to extract metatags from HTML files and index them in SOLR and TIKA

查看:98
本文介绍了如何从HTML文件中提取元标记并在SOLR和TIKA中对其进行索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取HTML文件的元标记,并通过tika集成将它们索引到solr中.我无法使用Tika提取这些元标记,也无法在solr中显示.

I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr.

我的HTML文件是这样的.

My HTML file is look like this.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="product_id" content="11"/>
<meta name="assetid" content="10001"/>
<meta name="title" content="title of the article"/>
<meta name="type" content="0xyzb"/>
<meta name="category" content="article category"/>
<meta name="first" content="details of the article"/>

<h4>title of the article</h4>
<p class="link"><a href="#link">How cite the Article</a></p>
<p class="list">
  <span class="listterm">Length: </span>13 to 15 feet<br>
  <span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
  <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
  <span class="listterm">Diet: </span>leaves and branches of trees<br>
  <span class="listterm">Number of Young: </span>1<br>
  <span class="listterm">Home: </span>Sahara<br>
</p>
</p>

我的data-config.xml文件看起来像这样

My data-config.xml file look like this

<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
    <document>   
    <entity name="f" dataSource="null" rootEntity="false"
        processor="FileListEntityProcessor"
        baseDir="/path/to/html/files/" 
        fileName=".*html|xml" onError="skip"
        recursive="false">

        <field column="fileAbsolutePath" name="path" />
        <field column="fileSize" name="size"/>
        <field column="file" name="filename"/>

        <entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" 
        url="${f.fileAbsolutePath}" format="text" onError="skip">

        <field column="product_id" name="product_id" meta="true"/>
        <field column="assetid" name="assetid" meta="true"/>
        <field column="title" name="title" meta="true"/>
        <field column="type" name="type" meta="true"/>
        <field column="first" name="first" meta="true"/>
        <field column="category" name="category" meta="true"/>      
        </entity>
    </entity>
</document>
</dataConfig>

在我的schema.xml文件中,添加了以下字段.

In my schema.xml file I have added the following fields.

<field name="product_id" type="string" indexed="true" stored="true"/>
<field name="assetid" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="category" type="string" indexed="true" stored="true"/>
<field name="first" type="text_general" indexed="true" stored="true"/>

在我的solrconfing.xml文件中,添加了以下代码.

In my solrconfing.xml file I have added the following code.

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" />
<lst name="defaults">
  <str name="config">/path/to/data-config.xml</str>
</lst>

有谁知道如何从HTML文件中提取这些元标记并在solr和Tika中对其进行索引?您的帮助将不胜感激.

can anyone know how to extract those metatags from the HTML files and index them in solr and Tika? your help will be appreciated.

推荐答案

您正在使用哪个版本的Solr?如果您使用的是 Solr 4.0 或更高版本,则

Which version of Solr you are using? If you are using Solr 4.0 or above then tika is embedded into it. Tika communicates with the the solr using the 'Solr-Cells' 'ExtractingRequestHandler' class that is configured in the solrconfig.xml as follows:

      <!-- Solr Cell Update Request Handler

       http://wiki.apache.org/solr/ExtractingRequestHandler 

    -->
  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

默认情况下,现在在solr中,如在上述配置中看到的那样,从HTML文档中提取的未在schema.xml中声明的任何字段都以'ignored _'为前缀,即它们被映射为'ignored _ *'动态字段.缺省的schema.xml内容如下:

Now in solr by default as you can see in above configuration, any fields extracted from HTML document that is not declared in schema.xml is prefixed with 'ignored_' i.e. they are mapped to 'ignored_*' dynamic field inside schema.xml. The default schema.xml that reads as follows:

       <!-- some trie-coded dynamic fields for faster range queries -->
   <dynamicField name="*_ti" type="tint"    indexed="true"  stored="true"/>
   <dynamicField name="*_tl" type="tlong"   indexed="true"  stored="true"/>
   <dynamicField name="*_tf" type="tfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_td" type="tdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_tdt" type="tdate"  indexed="true"  stored="true"/>

   <dynamicField name="*_pi"  type="pint"    indexed="true"  stored="true"/>
   <dynamicField name="*_c"   type="currency" indexed="true"  stored="true"/>

   <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
   <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

   <dynamicField name="random_*" type="random" />

   <!-- uncomment the following to ignore any fields that don't already match an existing 
        field name or dynamic field, rather than reporting them as an error. 
        alternately, change the type="ignored" to some other type e.g. "text" if you want 
        unknown fields indexed and/or stored by default --> 
   <!--dynamicField name="*" type="ignored" multiValued="true" /-->

 </fields>

以下是处理'ignored'类型的方法:

And following is how 'ignored' types are treated:

<!-- since fields of this type are by default not stored or indexed,
     any data added to them will be ignored outright.  --> 
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

因此,tika提取的元数据默认情况下由Solr-Cell放在'ignored'字段中,这就是为什么在索引和存储时会忽略它们的原因. 因此,要索引存储元数据,您可以更改"uprefix = attr _" 或'创建特定字段或动态字段",并根据需要对其进行处理.

So, metadata extracted by tika is by default put in 'ignored' field by the Solr-Cell and thats why they are ignored for indexing and storing. Therefore, to index and store the metadatas you either change the "uprefix=attr_" or 'create specific fields or dynamic field' for your known metadatas and treat them as you want.

因此,这是更正后的solrconfig.xml:

So, here is the corrected solrconfig.xml:

  <!-- Solr Cell Update Request Handler

       http://wiki.apache.org/solr/ExtractingRequestHandler 

    -->
  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">attr_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

这篇关于如何从HTML文件中提取元标记并在SOLR和TIKA中对其进行索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆