Apache Solr - 索引 ZIP 文件 [英] Apache Solr - Indexing ZIP files

查看:25
本文介绍了Apache Solr - 索引 ZIP 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的网络应用程序是一个电子邮件服务.它将电子邮件消息存储在 MySQL 数据库中,电子邮件附件在磁盘上.

My web app is an e-mail service. It stores email messages in MySQL database and email attachments are on a disk.

数据库类似于:

----------------------------------------------------------------------
| id | sender | receiver | subject | body | attach_dir | attachments |
----------------------------------------------------------------------
| 2  | 444    | 555      | Apples  | Hey! | /mnt/emails| att1.doc\r\n|
|    |        |          |         |      |            | att2.doc\r\n|
----------------------------------------------------------------------
| 3  | 77     | 22       | Pears   | Hola!| /mnt/emails| att1.zip\r\n|
----------------------------------------------------------------------

我使用以下 data-config.xml 对其进行索引:

I index it with the following data-config.xml:

<dataConfig>
<dataSource name="mysql"
            type="JdbcDataSource" 
            driver="com.mysql.jdbc.Driver"
            url="jdbc:mysql://localhost:3306/email?
              useUnicode=true&#038;
              characterEncoding=UTF-8&#038;
              useTimezone=true&#038;
              serverTimezone=UTC"
            user="user" 
            password="pass"/>

<dataSource name="files"
            type="BinFileDataSource" />
<document>
  <entity name="email" dataSource="mysql"
    query="SELECT id, subject, body, date, attach, attach_dir FROM email"
    transformer="RegexTransformer"
   >
     <field column="id" name="id"/>
     <field column="subject" name="subject"/>
     <field column="body" name="content"/>
     <field column="date" name="last_modified"/>
     <field column="attach" name="attach" splitBy="\r\n" />
     <field column="attach_dir" name="attach_dir"/>
     <entity name="attach_glob" dataSource="null" 
     processor="FileListEntityProcessor" 
     baseDir="/mnt/attach/${email.attach_dir}" fileName=".*" 
     recursive="false" onError="skip">
         <entity name="email_attachment" dataSource="files" 
         processor="TikaEntityProcessor" 
         url="${attach_glob.fileAbsolutePath}">
             <field column="text" name="attach_content"/>
         </entity>
     </entity>         
  </entity>
</document>
</dataConfig>

这适用于除.zip 等压缩文件之外的所有文件.对于 .zip 文件,attach_content 字段仅填充来自 zip 存档的文件名,而不是从 zip 存档中提取的文件的内容.

This is working good with all the files except compressed files such as .zip. For .zip files the attach_content field gets filled only with the file names from the zip archive instead of content of the extracted files from the zip archives.

但是,如果我像这样使用 SimplePostTool:

However if I use SimplePostTool like this:

/opt/solr/bin/post -c mycollection /mnt/attach/message3/att1.zip

然后我从 zip 存档中的所有文件中提取所有内容,这就是我需要的.但我需要此内容成为数据导入处理程序添加的文档的一部分,并带有上面的 data-config.xml.

then I get all content extracted from all the files inside of the zip archive and this is what I need. But I would need this content to be part of the documents added by Data Import Handler with the data-config.xml above.

这可能吗?

推荐答案

您需要在 TikaEntityProcessor 配置上将 extractEmbedded 设置为 true 以设置适当的 Parser在 Apache Tika ParseContext 中用于解析嵌入的文档.

You need need to set extractEmbedded to true on the TikaEntityProcessor configuration for it to set the appropriate Parser in the Apache Tika ParseContext for it to parse embedded documents.

例如,您可以将问题中的配置更改为如下设置:

For example, you can change you configuration from the question to have this set like the below:

 <entity name="email_attachment" dataSource="files" 
     processor="TikaEntityProcessor" 
     url="${attach_glob.fileAbsolutePath}" extractEmbedded="true">
         <field column="text" name="attach_content"/>
  </entity>

此处了解更多详情.

这篇关于Apache Solr - 索引 ZIP 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆