如何从 HTML 文件中提取元标记并在 SOLR 和 TIKA 中索引它们 [英] How to extract metatags from HTML files and index them in SOLR and TIKA

查看:21
本文介绍了如何从 HTML 文件中提取元标记并在 SOLR 和 TIKA 中索引它们的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取 HTML 文件的元标记,并通过 tika 集成将它们索引到 solr 中.我无法使用 Tika 提取这些元标记,也无法在 solr 中显示.

I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr.

我的 HTML 文件是这样的.

My HTML file is look like this.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="product_id" content="11"/>
<meta name="assetid" content="10001"/>
<meta name="title" content="title of the article"/>
<meta name="type" content="0xyzb"/>
<meta name="category" content="article category"/>
<meta name="first" content="details of the article"/>

<h4>title of the article</h4>
<p class="link"><a href="#link">How cite the Article</a></p>
<p class="list">
  <span class="listterm">Length: </span>13 to 15 feet<br>
  <span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
  <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
  <span class="listterm">Diet: </span>leaves and branches of trees<br>
  <span class="listterm">Number of Young: </span>1<br>
  <span class="listterm">Home: </span>Sahara<br>
</p>
</p>

我的 data-config.xml 文件是这样的

My data-config.xml file look like this

<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
    <document>   
    <entity name="f" dataSource="null" rootEntity="false"
        processor="FileListEntityProcessor"
        baseDir="/path/to/html/files/" 
        fileName=".*html|xml" onError="skip"
        recursive="false">

        <field column="fileAbsolutePath" name="path" />
        <field column="fileSize" name="size"/>
        <field column="file" name="filename"/>

        <entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" 
        url="${f.fileAbsolutePath}" format="text" onError="skip">

        <field column="product_id" name="product_id" meta="true"/>
        <field column="assetid" name="assetid" meta="true"/>
        <field column="title" name="title" meta="true"/>
        <field column="type" name="type" meta="true"/>
        <field column="first" name="first" meta="true"/>
        <field column="category" name="category" meta="true"/>      
        </entity>
    </entity>
</document>
</dataConfig>

在我的 schema.xml 文件中,我添加了以下字段.

In my schema.xml file I have added the following fields.

<field name="product_id" type="string" indexed="true" stored="true"/>
<field name="assetid" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="category" type="string" indexed="true" stored="true"/>
<field name="first" type="text_general" indexed="true" stored="true"/>

在我的 solrconfing.xml 文件中,我添加了以下代码.

In my solrconfing.xml file I have added the following code.

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" />
<lst name="defaults">
  <str name="config">/path/to/data-config.xml</str>
</lst>

谁能知道如何从 HTML 文件中提取这些元标记并在 solr 和 Tika 中索引它们?您的帮助将不胜感激.

can anyone know how to extract those metatags from the HTML files and index them in solr and Tika? your help will be appreciated.

推荐答案

您使用的是哪个版本的 Solr?如果您使用 Solr 4.0 或更高版本,则 tika 嵌入其中.Tika 使用 'Solr-Cells' 'ExtractingRequestHandler' 类与 solr 通信,该类在 solrconfig.xml 中配置如下:

Which version of Solr you are using? If you are using Solr 4.0 or above then tika is embedded into it. Tika communicates with the the solr using the 'Solr-Cells' 'ExtractingRequestHandler' class that is configured in the solrconfig.xml as follows:

      <!-- Solr Cell Update Request Handler

       http://wiki.apache.org/solr/ExtractingRequestHandler 

    -->
  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

现在默认在 solr 中,正如您在上面的配置中看到的那样,从 HTML 文档中提取的任何未在 schema.xml 中声明的字段都以 'ignored_' 为前缀,即它们被映射到 'ignored_*' schema.xml 中的动态字段.默认的 schema.xml,内容如下:

Now in solr by default as you can see in above configuration, any fields extracted from HTML document that is not declared in schema.xml is prefixed with 'ignored_' i.e. they are mapped to 'ignored_*' dynamic field inside schema.xml. The default schema.xml that reads as follows:

       <!-- some trie-coded dynamic fields for faster range queries -->
   <dynamicField name="*_ti" type="tint"    indexed="true"  stored="true"/>
   <dynamicField name="*_tl" type="tlong"   indexed="true"  stored="true"/>
   <dynamicField name="*_tf" type="tfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_td" type="tdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_tdt" type="tdate"  indexed="true"  stored="true"/>

   <dynamicField name="*_pi"  type="pint"    indexed="true"  stored="true"/>
   <dynamicField name="*_c"   type="currency" indexed="true"  stored="true"/>

   <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
   <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

   <dynamicField name="random_*" type="random" />

   <!-- uncomment the following to ignore any fields that don't already match an existing 
        field name or dynamic field, rather than reporting them as an error. 
        alternately, change the type="ignored" to some other type e.g. "text" if you want 
        unknown fields indexed and/or stored by default --> 
   <!--dynamicField name="*" type="ignored" multiValued="true" /-->

 </fields>

以下是 'ignored' 类型的处理方式:

And following is how 'ignored' types are treated:

<!-- since fields of this type are by default not stored or indexed,
     any data added to them will be ignored outright.  --> 
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

因此,默认情况下,由 tika 提取的元数据由 Solr-Cell 放入 'ignored' 字段中,这就是它们在索引和存储时被忽略的原因.因此,要索引存储元数据,您可以更改uprefix=attr_"创建特定字段或动态字段' 用于您已知的元数据,并根据需要处理它们.

So, metadata extracted by tika is by default put in 'ignored' field by the Solr-Cell and thats why they are ignored for indexing and storing. Therefore, to index and store the metadatas you either change the "uprefix=attr_" or 'create specific fields or dynamic field' for your known metadatas and treat them as you want.

所以,这里是更正后的 solrconfig.xml:

So, here is the corrected solrconfig.xml:

  <!-- Solr Cell Update Request Handler

       http://wiki.apache.org/solr/ExtractingRequestHandler 

    -->
  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">attr_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

这篇关于如何从 HTML 文件中提取元标记并在 SOLR 和 TIKA 中索引它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆