使用 Algolia 搜索(提取文本)PDF 文件 [英] Searching (extracting text) PDF files with Algolia

查看:39
本文介绍了使用 Algolia 搜索(提取文本)PDF 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于拥有大量 PDF 文件的客户来说,这只是一个推测性的想法.

This is just a speculative idea for a client who has a lot of PDF files.

Algolia 在他们的常见问题解答中说,要搜索 PDF 文件,您首先需要从文件中提取文本.你会怎么做?

Algolia say in their FAQs that to search PDF files you first need to extract the text from the file. How would you go about this?

我设想的系统工作方式是:

The way I envisage the a system working would be:

  • 客户通过 CMS 上传 PDF
  • CMS 调用一些服务/程序来提取文本
  • Algolia 将提取的索引编入索引,并且不知何故链接到原始 PDF

它需要是一个自动化系统,因为客户端不应该告诉它索引.它将用 PHP 构建,可能是在 Ubuntu 上运行的 Laravel.

It would need to be an automated system as the client shouldn't have to tell it to index. It would be built in PHP, probably Laravel running on Ubuntu.

什么软件/服务可以从 PDF 中提取文本,是否需要任何魔法将其与 PDF 文件链接"?

What software / service could do the text extraction from the PDFs and is any magic needed to 'link' this with the PDF file?

我也很高兴对其他可能处理此问题的搜索服务提出建议.

I'm also happy to have suggestions on other search services which may handle this.

推荐答案

幸运的是,从 pdf 中提取文本是一个已经多次涉及的主题.在命令行上,您可以使用 pdftotext(在 Linux 或 Mac 上可用)或在您的代码中使用一个库作为 Apache Tika(为此您可以找到一个 PHP 包装器).

Fortunately, text extraction from pdf's is a subject that has been covered multiple times. On the command line, you could use pdftotext (available on Linux or Mac) or in your code a library as Apache Tika (for which you can find a PHP wrapper).

为避免记录中出现过多干扰,我建议您然后拆分文本并为每个段落创建一个记录.然后你可以使用 Algolia 的 distinct对结果进行重复数据删除的功能.

To avoid having too much noise in your records, I'd recommend you to then split the text and create one record per paragraph. You can then use Algolia's distinct feature to deduplicate the results.

您应该已经在某处拥有指向文件的链接,只需将它们存储在您的记录中,然后,在您的前端,您就可以轻松地使用例如 autocomplete.jsinstantsearch.js .

You should already have the links to your files somewhere, just store them in your records and then, in your front-end you'll easily be able to create links to them using for instance autocomplete.js or instantsearch.js .

这篇关于使用 Algolia 搜索(提取文本)PDF 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆