Apache Solr - 索引 ZIP 文件 [英] Apache Solr - Indexing ZIP files
问题描述
我的网络应用程序是一个电子邮件服务.它将电子邮件消息存储在 MySQL 数据库中,电子邮件附件在磁盘上.
My web app is an e-mail service. It stores email messages in MySQL database and email attachments are on a disk.
数据库类似于:
----------------------------------------------------------------------
| id | sender | receiver | subject | body | attach_dir | attachments |
----------------------------------------------------------------------
| 2 | 444 | 555 | Apples | Hey! | /mnt/emails| att1.doc\r\n|
| | | | | | | att2.doc\r\n|
----------------------------------------------------------------------
| 3 | 77 | 22 | Pears | Hola!| /mnt/emails| att1.zip\r\n|
----------------------------------------------------------------------
我使用以下 data-config.xml 对其进行索引:
I index it with the following data-config.xml:
<dataConfig>
<dataSource name="mysql"
type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/email?
useUnicode=true&
characterEncoding=UTF-8&
useTimezone=true&
serverTimezone=UTC"
user="user"
password="pass"/>
<dataSource name="files"
type="BinFileDataSource" />
<document>
<entity name="email" dataSource="mysql"
query="SELECT id, subject, body, date, attach, attach_dir FROM email"
transformer="RegexTransformer"
>
<field column="id" name="id"/>
<field column="subject" name="subject"/>
<field column="body" name="content"/>
<field column="date" name="last_modified"/>
<field column="attach" name="attach" splitBy="\r\n" />
<field column="attach_dir" name="attach_dir"/>
<entity name="attach_glob" dataSource="null"
processor="FileListEntityProcessor"
baseDir="/mnt/attach/${email.attach_dir}" fileName=".*"
recursive="false" onError="skip">
<entity name="email_attachment" dataSource="files"
processor="TikaEntityProcessor"
url="${attach_glob.fileAbsolutePath}">
<field column="text" name="attach_content"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
这适用于除.zip
等压缩文件之外的所有文件.对于 .zip
文件,attach_content
字段仅填充来自 zip 存档的文件名,而不是从 zip 存档中提取的文件的内容.
This is working good with all the files except compressed files such as .zip
. For .zip
files the attach_content
field gets filled only with the file names from the zip archive instead of content of the extracted files from the zip archives.
但是,如果我像这样使用 SimplePostTool
:
However if I use SimplePostTool
like this:
/opt/solr/bin/post -c mycollection /mnt/attach/message3/att1.zip
然后我从 zip 存档中的所有文件中提取所有内容,这就是我需要的.但我需要此内容成为数据导入处理程序添加的文档的一部分,并带有上面的 data-config.xml.
then I get all content extracted from all the files inside of the zip archive and this is what I need. But I would need this content to be part of the documents added by Data Import Handler with the data-config.xml above.
这可能吗?
推荐答案
您需要在 TikaEntityProcessor 配置上将 extractEmbedded 设置为 true 以设置适当的 Parser在 Apache Tika ParseContext 中用于解析嵌入的文档.
You need need to set extractEmbedded to true on the TikaEntityProcessor configuration for it to set the appropriate Parser in the Apache Tika ParseContext for it to parse embedded documents.
例如,您可以将问题中的配置更改为如下设置:
For example, you can change you configuration from the question to have this set like the below:
<entity name="email_attachment" dataSource="files"
processor="TikaEntityProcessor"
url="${attach_glob.fileAbsolutePath}" extractEmbedded="true">
<field column="text" name="attach_content"/>
</entity>
见 此处了解更多详情.
这篇关于Apache Solr - 索引 ZIP 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!