使用Algolia搜索(提取文本)PDF文件 [英] Searching (extracting text) PDF files with Algolia

查看:146
本文介绍了使用Algolia搜索(提取文本)PDF文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于拥有大量PDF文件的客户来说,这只是一个推测性想法.

This is just a speculative idea for a client who has a lot of PDF files.

阿尔及利亚在其常见问题解答中说,要搜索PDF文件,您首先需要从文件中提取文本.您将如何处理?

Algolia say in their FAQs that to search PDF files you first need to extract the text from the file. How would you go about this?

我设想系统正常运行的方式是:

The way I envisage the a system working would be:

  • 客户端通过CMS上传PDF
  • CMS调用某些服务/程序来 提取文字
  • 阿尔及利亚对提取的内容进行索引并以某种方式 链接到原始PDF
  • Client uploads PDF via CMS
  • CMS calls some service / program to extract the text
  • Algolia indexes the extracted and it's somehow linked to the original PDF

这将是一个自动化系统,因为客户端不必告诉它建立索引. 它将用PHP构建,可能是Laravel在Ubuntu上运行.

It would need to be an automated system as the client shouldn't have to tell it to index. It would be built in PHP, probably Laravel running on Ubuntu.

什么软件/服务可以从PDF中提取文本,将其链接"到PDF文件需要魔术吗?

What software / service could do the text extraction from the PDFs and is any magic needed to 'link' this with the PDF file?

我也很高兴就可能解决此问题的其他搜索服务提出建议.

I'm also happy to have suggestions on other search services which may handle this.

推荐答案

幸运的是,从pdf的文本提取是一个涵盖了多次的主题.在命令行上,您可以使用pdftotext(在Linux或Mac上可用),或在代码中使用 Apache Tika (您可以找到 PHP包装器).

Fortunately, text extraction from pdf's is a subject that has been covered multiple times. On the command line, you could use pdftotext (available on Linux or Mac) or in your code a library as Apache Tika (for which you can find a PHP wrapper).

为避免记录中有太多杂音,建议您拆分文本并为每个段落创建一个记录.然后,您可以使用Algolia的 distinct 功能对结果进行重复数据删除

To avoid having too much noise in your records, I'd recommend you to then split the text and create one record per paragraph. You can then use Algolia's distinct feature to deduplicate the results.

您应该已经有指向文件的链接,只需将它们存储在记录中,然后在前端,您就可以轻松地使用例如 instantsearch.js .

You should already have the links to your files somewhere, just store them in your records and then, in your front-end you'll easily be able to create links to them using for instance autocomplete.js or instantsearch.js .

这篇关于使用Algolia搜索(提取文本)PDF文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆