如何索引 PDF 文件并搜索关键字? [英] How do I Index PDF files and search for keywords?

查看:28
本文介绍了如何索引 PDF 文件并搜索关键字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我拥有的是一堆 PDF(几百个).它们没有适当的结构,也没有特定的字段.他们所拥有的只是大量的文字.

What I have is a bunch of PDFs (few 100s). They don't have a proper structure nor do they have particular fields. All they have is lot of text.

我想做什么:

索引 PDF 并根据索引搜索一些关键字.我有兴趣查找该特定关键字是否在 PDF 文档中,如果是,我想要找到关键字的行.如果我在包含该术语的 PDF 文档中搜索Google",我希望看到Google 是一个很棒的搜索引擎",这是 PDF 中的一行.

Index the PDFs and search for some keywords against the index. I am interested in finding if that particular keyword is in the PDF doc and if it is, I want the line where the keyword is found. If I searched for 'Google' in a PDF doc that has that term, I would like to see 'Google is a great search engine' which is the line in the PDF.

我如何决定:

使用 SOLR 或 Whoosh,但 SOLR 看起来很适合内置 PDF 支持.我更喜欢用 Python 编写代码,而 Sunburst 是我喜欢的 SOLR 的包装器.SOLR 的示例/示例项目有一些基于价格比较的模式文件.现在我不确定是否可以使用 SOLR 来回答我的问题.

Either use SOLR or Whoosh but SOLR is looking good for inbuilt PDF support. I prefer to code in Python and Sunburst is a wrapper on SOLR which I like. SOLR's sample/example project has some price comparision based schema file. Now I am not sure if I can use SOLR to answer my problem.

你们有什么建议?非常感谢任何输入.

What do you guys suggest? Any input is much appreciated.

推荐答案

我认为 Solr 适合您的需求.

I think Solr fits your needs.

突出显示"功能正是您所需要的.为此,您必须将文档编入索引并将其存储在 lucene 索引中.

The "Highlighting" feature is what you are looking for.. For that you have to index and to store the documents in lucene index.

突出显示功能返回一个片段,其中标记了搜索到的文本.

The highlighting feature returns a snipped, where the searched text is marked.

看看这个:http://wiki.apache.org/solr/HighlightingParameters

这篇关于如何索引 PDF 文件并搜索关键字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆