关键词搜索文档管理系统 [英] Keyword Search for Document Management System

查看:94
本文介绍了关键词搜索文档管理系统的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述





我正在尝试为内部用途开发基于网络的文档管理系统,并遇到了一个我未能找到答案的问题。



我希望为用户提供一个搜索框,他们可以在其中输入关键字来搜索文档内容。如果找到匹配项,则应用程序将显示这些文件。这些文件通常是PDF文件,但在某些罕见的情况下,目录可能包含Word文件。



我将不胜感激任何帮助,以确定可用的不同选项实现这一目标以及您可以指导我帮助开发的任何资源。



感谢您抽出宝贵时间阅读我的请求。

Mo

Hi,

I am trying to develop a web based document management system for internal uses and have come across an issue to which I have failed to find an answer.

I was hoping to provide the user a search box where they can enter a keyword to search through the contents of documents. If a match is found then the application will display those files. The documents will generally be PDF documents, but the directory may contain Word documents on some rare occasions.

I would appreciate any help in identifying the different options that are available to achieve this and any resources that you can point me towards to help with the development.

Thank you for taking the time to read my request.
Mo

推荐答案

由于这是一个文本搜索,您可能希望将word文档转换为plain(Unicode)文本以进行搜索。

如果您的文档是.docx,而不是旧的.doc,那将是最好的。这种较新的格式基于Open XML: http://en.wikipedia.org/wiki/Office_Open_XML [ ^ ]。



因此,最简单的后备解决方案,仅用于搜索,可以这么简单:使用ZIP算法解压缩.docx文档(这是这些文档的打包方式)并使用提取的XML(您也可以删除所有XML标记)来处理文本搜索。



如需更精细的谷物使用,您可以使用Open XML SDK,您可以从Microsoft免费获得。请参阅我过去的答案和参考答案以及其他材料:如何从Add Reference中添加microsoft excel 15.0对象库MS Visual Studio 2010 [ ^ ]。



顺便提一下,请参阅Microsoft警告不要在服务器设置中使用Office互操作:

http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757# kb2 [ ^ ],

http://support.microsoft.com/kb/ 257757 / zh-CN [ ^ ]。



如果你必须使用旧的.doc文件该怎么办?最好一定要避免它们(为什么不在存储到网站之前将它们转换?)但是这仍然可以与它们一起使用,但要困难得多。我所知道的唯一来源是开源Libre Office附带的API。请参阅上面引用的我的答案,第一个链接在其中。



您还可以尝试查找其他内容:

http ://bit.ly/15DSm5l [ ^ ],
http://bit.ly/15MYwki [ ^ ]。



但是,我会完全避免使用Office文档。尽管Open XML目前是一种公共标准,但Office文档和应用程序仍然是专有的,并不是W3标准的一部分。是不是可以将它重新用于某些基于HTML或XML的文档?



-SA
As this is a text search, you may want to convert word documents to plain (Unicode) text for search purposes.
It would be the best if your document were .docx, not old .doc. This newer format is based on Open XML: http://en.wikipedia.org/wiki/Office_Open_XML[^].

So, the simplest fallback solution, just for search, could be as simple as this: unpack .docx document using ZIP algorithm (this is how such documents are packed) and use extracted XML (you can also removed all XML tags) for text search.

For more fine grain use, you can use Open XML SDK, which you can obtain from Microsoft free of charge. Please see my past answer and referenced answers and other materials: How to add microsoft excel 15.0 object library from Add Reference in MS Visual Studio 2010[^].

By, the way, see also Microsoft warnings against using Office interop in server settings:
http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2[^],
http://support.microsoft.com/kb/257757/en-us[^].

What to do if you have to use old .doc file? It's better to avoid them by all means (why not converting them before storing on the site?) but this is still possible to work with them, but much harder. The only source I know is the API which comes with open-source Libre Office. Please see my answer referenced above, first link in it.

You can also try to find something else:
http://bit.ly/15DSm5l[^],
http://bit.ly/15MYwki[^].

However, I would avoid Office documents at all. Even though Open XML is presently a public standard, the Office documents and applications are still proprietary and are not the part of W3 standards. Isn't it possible to re-word it to some HTML or XML-based documentation?

—SA


这篇关于关键词搜索文档管理系统的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆