如何使用Apache Solr索引文本文件 [英] How to index text files using apache solr

查看:71
本文介绍了如何使用Apache Solr索引文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想索引文本文件.经过大量搜索后,我了解了Apache tika.现在,在一些我研究过Apache tika的站点中,我知道Apache tika将文本转换为XML格式,然后将其发送到solr.但是在转换时只会创建一个标签示例 ....... 现在,我希望索引的文本文件是tomcat本地主机访问文件.该文件以GB为单位.我无法存储它和单个索引.我希望每行都有line-id ....... 这样我就可以轻松检索匹配的行.

I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example ....... Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id ....... So that i can easily retrieve the matching line.

这可以在Apache Tika中完成吗?

Can this be done in Apache Tika?

推荐答案

带有Tika的Solr支持从多种文件格式中提取数据.
可以在@ 链接

Solr with Tika supports extraction of data from multiple file formats.
The complete list of supported file formats can be found @ link

您可以提供上述任何一种文件格式作为输入,Tika将能够自动检测该文件格式并从文件中提取文本并将其提供给Solr进行索引.

You can provide as an input any of the above file formats and Tika would be able to autodetect the file format and extract text from the files and provide it to Solr for indexing.

-
在将文本文件固定到Solr之前,Tika不会将其转换为XML. Tika只会提取元数据和文件的内容,并按照定义的映射在Solr中填充字段.

Edit :-
Tika does not convert the text file to XML before sneding it to Solr. Tika would just extract the metadata and the content of the file and populate fields in Solr as per the mapping defined.

您要么必须将整个文件作为输入送入solr,它会被索引为一个文档,或者您必须逐行读取文件并将其作为单独的文档提供给Solr.
Solr和Tika不会为您解决这个问题.

You either have to feed the entire file as input to solr, which would be indexed as a single document OR you have to read the file line by line and provide it to Solr as a seperate document.
Solr and Tika would not handle this for you.

这篇关于如何使用Apache Solr索引文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆