Apache Solr-索引ZIP文件 [英] Apache Solr - Indexing ZIP files

查看:66
本文介绍了Apache Solr-索引ZIP文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Web应用程序是一项电子邮件服务.它将电子邮件存储在MySQL数据库中,电子邮件附件位于磁盘上.

My web app is an e-mail service. It stores email messages in MySQL database and email attachments are on a disk.

数据库类似于:

----------------------------------------------------------------------
| id | sender | receiver | subject | body | attach_dir | attachments |
----------------------------------------------------------------------
| 2  | 444    | 555      | Apples  | Hey! | /mnt/emails| att1.doc\r\n|
|    |        |          |         |      |            | att2.doc\r\n|
----------------------------------------------------------------------
| 3  | 77     | 22       | Pears   | Hola!| /mnt/emails| att1.zip\r\n|
----------------------------------------------------------------------

我使用以下data-config.xml对其进行索引:

I index it with the following data-config.xml:

<dataConfig>
<dataSource name="mysql"
            type="JdbcDataSource" 
            driver="com.mysql.jdbc.Driver"
            url="jdbc:mysql://localhost:3306/email?
              useUnicode=true&#038;
              characterEncoding=UTF-8&#038;
              useTimezone=true&#038;
              serverTimezone=UTC"
            user="user" 
            password="pass"/>

<dataSource name="files"
            type="BinFileDataSource" />
<document>
  <entity name="email" dataSource="mysql"
    query="SELECT id, subject, body, date, attach, attach_dir FROM email"
    transformer="RegexTransformer"
   >
     <field column="id" name="id"/>
     <field column="subject" name="subject"/>
     <field column="body" name="content"/>
     <field column="date" name="last_modified"/>
     <field column="attach" name="attach" splitBy="\r\n" />
     <field column="attach_dir" name="attach_dir"/>
     <entity name="attach_glob" dataSource="null" 
     processor="FileListEntityProcessor" 
     baseDir="/mnt/attach/${email.attach_dir}" fileName=".*" 
     recursive="false" onError="skip">
         <entity name="email_attachment" dataSource="files" 
         processor="TikaEntityProcessor" 
         url="${attach_glob.fileAbsolutePath}">
             <field column="text" name="attach_content"/>
         </entity>
     </entity>         
  </entity>
</document>
</dataConfig>

这对除压缩文件(例如 .zip )以外的所有文件都适用.对于 .zip 文件, attach_content 字段仅填充zip存档中的文件名,而不填充zip存档中提取的文件的内容.

This is working good with all the files except compressed files such as .zip. For .zip files the attach_content field gets filled only with the file names from the zip archive instead of content of the extracted files from the zip archives.

但是,如果我像这样使用 SimplePostTool :

However if I use SimplePostTool like this:

/opt/solr/bin/post -c mycollection /mnt/attach/message3/att1.zip

然后,我从zip存档内的所有文件中提取了所有内容,这就是我所需要的.但是我需要将此内容作为数据导入处理程序通过上述data-config.xml添加的文档的一部分.

then I get all content extracted from all the files inside of the zip archive and this is what I need. But I would need this content to be part of the documents added by Data Import Handler with the data-config.xml above.

这可能吗?

推荐答案

您需要在TikaEntityProcessor配置上将 extractEmbedded 设置为 true ,以设置适当的解析器在Apache Tika ParseContext 中用于解析嵌入的文档.

You need need to set extractEmbedded to true on the TikaEntityProcessor configuration for it to set the appropriate Parser in the Apache Tika ParseContext for it to parse embedded documents.

例如,您可以从问题中更改配置,使其具有如下所示的设置:

For example, you can change you configuration from the question to have this set like the below:

 <entity name="email_attachment" dataSource="files" 
     processor="TikaEntityProcessor" 
     url="${attach_glob.fileAbsolutePath}" extractEmbedded="true">
         <field column="text" name="attach_content"/>
  </entity>

请参见此处以获取更多详细信息.

See here for more details.

这篇关于Apache Solr-索引ZIP文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆