从 Rails 应用程序(Word、PDF、Excel 等)中搜索附件 [英] Searching attachments from a Rails app (Word, PDF, Excel etc)
问题描述
我在 Stack Overflow 上的第一篇文章,所以请保持温和!我即将为客户启动一个新的 Ruby on Rails (3.1) 项目.他们的要求之一是有一个搜索引擎,它将为大约 2,000 个文档编制索引,这些文档是 PDF、Word、Excel 和 HTML 的混合体.
My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML.
我曾希望使用 think-sphinx 或 Texticle(最流行的在 https://www.ruby-toolbox.com/categories/rails_search.html) 但据我所知:
I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html) but as I understand it:
- Texticle 需要 PostgreSQL.我使用的是 MySQL.
- thinking-sphinx 不会索引文件系统上的文件.
- 即使我将附件保存到数据库中,thinking-sphinx 仍然无法工作,因为它需要纯文本(根据 http://groups.google.com/group/thinking-sphinx/browse_thread/thread/69cdc1c8e1c096ff)
所以我有两个选择:
- 选择不同的搜索工具
- 尝试将附件的纯文本版本提取到数据库中以供thinking-sphinx阅读
您推荐哪种方法?
如果是不同的搜索工具,是哪个?我的要求非常基本,所以我真的很喜欢一个非常容易设置并且有大量文档、示例和教程的要求!
If it's a different search tool, which one? My requirements are pretty basic so I'd really like one that's very easy to set up and has lots of documentation, examples and tutorials!
如果是解压,能推荐一下PDF、Word、Excel、HTML等常见文件类型的解压器吗?
If it's extracting, can you recommend extractors for common file types such as PDF, Word, Excel and HTML?
谢谢大家.非常感谢您的帮助.
Thanks everyone. Really appreciate your help.
推荐答案
只是为了更新这个.我决定采用的方法是:
Just to update this. The approach I've decided to go with is:
尝试将附件的纯文本版本提取到数据库中以供thinking-sphinx阅读
具体来说,我将执行以下操作:
Specifically, I'll be doing the following:
- 使用thinking-sphinx
- 使用 subexec gem 调用 ...
- ... Tika 来自命令行
- Using thinking-sphinx
- Using the subexec gem to call ...
- ... Tika from the command line
看起来就像调用 java -jar tika-app-0.10.jar -t [file]
一样简单,但如果结果更多,我会发布我的经验复杂!
It looks as if it will be as simple as calling java -jar tika-app-0.10.jar -t [file]
but I'll post my experiences if it turns out to be more complicated!
这篇关于从 Rails 应用程序(Word、PDF、Excel 等)中搜索附件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!