如何索引 PDF 文件并搜索关键字? [英] How do I Index PDF files and search for keywords?

查看：28 发布时间：2021/12/30 8:16:27 python pdf indexing solr

本文介绍了如何索引 PDF 文件并搜索关键字?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我拥有的是一堆 PDF(几百个).它们没有适当的结构，也没有特定的字段.他们所拥有的只是大量的文字.

What I have is a bunch of PDFs (few 100s). They don't have a proper structure nor do they have particular fields. All they have is lot of text.

我想做什么:

索引 PDF 并根据索引搜索一些关键字.我有兴趣查找该特定关键字是否在 PDF 文档中，如果是，我想要找到关键字的行.如果我在包含该术语的 PDF 文档中搜索Google"，我希望看到Google 是一个很棒的搜索引擎"，这是 PDF 中的一行.

Index the PDFs and search for some keywords against the index. I am interested in finding if that particular keyword is in the PDF doc and if it is, I want the line where the keyword is found. If I searched for 'Google' in a PDF doc that has that term, I would like to see 'Google is a great search engine' which is the line in the PDF.

我如何决定:

使用 SOLR 或 Whoosh，但 SOLR 看起来很适合内置 PDF 支持.我更喜欢用 Python 编写代码，而 Sunburst 是我喜欢的 SOLR 的包装器.SOLR 的示例/示例项目有一些基于价格比较的模式文件.现在我不确定是否可以使用 SOLR 来回答我的问题.

Either use SOLR or Whoosh but SOLR is looking good for inbuilt PDF support. I prefer to code in Python and Sunburst is a wrapper on SOLR which I like. SOLR's sample/example project has some price comparision based schema file. Now I am not sure if I can use SOLR to answer my problem.

你们有什么建议?非常感谢任何输入.

What do you guys suggest? Any input is much appreciated.

如何索引 PDF 文件并搜索关键字? [英] How do I Index PDF files and search for keywords?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何索引 PDF 文件并搜索关键字? [英] How do I Index PDF files and search for keywords?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭