从 Rails 应用程序(Word、PDF、Excel 等)中搜索附件 [英] Searching attachments from a Rails app (Word, PDF, Excel etc)

查看:23
本文介绍了从 Rails 应用程序(Word、PDF、Excel 等)中搜索附件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Stack Overflow 上的第一篇文章,所以请保持温和!我即将为客户启动一个新的 Ruby on Rails (3.1) 项目.他们的要求之一是有一个搜索引擎,它将为大约 2,000 个文档编制索引,这些文档是 PDF、Word、Excel 和 HTML 的混合体.

My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML.

我曾希望使用 think-sphinx 或 Texticle(最流行的在 https://www.ruby-toolbox.com/categories/rails_search.html) 但据我所知:

I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html) but as I understand it:

所以我有两个选择:

  1. 选择不同的搜索工具
  2. 尝试将附件的纯文本版本提取到数据库中以供thinking-sphinx阅读

您推荐哪种方法?

如果是不同的搜索工具,是哪个?我的要求非常基本,所以我真的很喜欢一个非常容易设置并且有大量文档、示例和教程的要求!

If it's a different search tool, which one? My requirements are pretty basic so I'd really like one that's very easy to set up and has lots of documentation, examples and tutorials!

如果是解压,能推荐一下PDF、Word、Excel、HTML等常见文件类型的解压器吗?

If it's extracting, can you recommend extractors for common file types such as PDF, Word, Excel and HTML?

谢谢大家.非常感谢您的帮助.

Thanks everyone. Really appreciate your help.

推荐答案

只是为了更新这个.我决定采用的方法是:

Just to update this. The approach I've decided to go with is:

尝试将附件的纯文本版本提取到数据库中以供thinking-sphinx阅读

具体来说,我将执行以下操作:

Specifically, I'll be doing the following:

  • Using thinking-sphinx
  • Using the subexec gem to call ...
  • ... Tika from the command line

看起来就像调用 java -jar tika-app-0.10.jar -t [file] 一样简单,但如果结果更多,我会发布我的经验复杂!

It looks as if it will be as simple as calling java -jar tika-app-0.10.jar -t [file] but I'll post my experiences if it turns out to be more complicated!

这篇关于从 Rails 应用程序(Word、PDF、Excel 等)中搜索附件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆