如何使用 apache solr 索引文本文件 [英] How to index text files using apache solr

查看:35
本文介绍了如何使用 apache solr 索引文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想索引文本文件.经过大量搜索,我了解了 Apache tika.现在在我研究 Apache tika 的一些站点中,我了解到 Apache tika 将文本转换为 XML 格式,然后将其发送到 solr.但是在转换时它只会创建一个标签示例......现在我希望索引的文本文件是一个 tomcat 本地主机访问文件.此文件以 GB 为单位.我无法存储它和单个索引.我希望每一行都有 line-id......这样我就可以轻松检索匹配的行.

I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example ....... Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id ....... So that i can easily retrieve the matching line.

这可以在 Apache Tika 中完成吗?

Can this be done in Apache Tika?

推荐答案

Solr with Tika 支持从多种文件格式中提取数据.
支持的文件格式的完整列表可以在@link

Solr with Tika supports extraction of data from multiple file formats.
The complete list of supported file formats can be found @ link

您可以提供上述任何文件格式作为输入,Tika 将能够自动检测文件格式并从文件中提取文本并将其提供给 Solr 进行索引.

You can provide as an input any of the above file formats and Tika would be able to autodetect the file format and extract text from the files and provide it to Solr for indexing.

-
在将文本文件转换为 Solr 之前,Tika 不会将其转换为 XML.Tika 只需提取元数据和文件内容,并根据定义的映射在 Solr 中填充字段.

Edit :-
Tika does not convert the text file to XML before sneding it to Solr. Tika would just extract the metadata and the content of the file and populate fields in Solr as per the mapping defined.

您要么必须将整个文件作为输入提供给 solr,该文件将被索引为单个文档,要么您必须逐行读取文件并将其作为单独的文档提供给 Solr.
Solr 和 Tika 不会为你处理这个.

You either have to feed the entire file as input to solr, which would be indexed as a single document OR you have to read the file line by line and provide it to Solr as a seperate document.
Solr and Tika would not handle this for you.

这篇关于如何使用 apache solr 索引文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆