索引Word / PDF文档从文件系统到SQL Server [英] Index Word/PDF Documents From File System To SQL Server

查看:141
本文介绍了索引Word / PDF文档从文件系统到SQL Server的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图想出一个简单的解决方案来解决我遇到的问题,因为到目前为止我发现的所有这些看起来都太复杂了!



是我们使用专有应用程序来管理我们业务的大部分方面。它有一个相当大的SQL Server 2005后端数据库。该应用程序还允许将Word和PDF文档附加到我们广泛使用的记录中,并将这些文件存储在服务器上的文件系统中,并在数据库中引用文件名。不幸的是,应用程序中的搜索设施很差,所以我正在尝试构建自己的版本。



到目前为止,我有一个整洁的ASP.NET页面,搜索框允许用户输入单词进行搜索,并在其他字段(如部门,日期等)上过滤结果。我在数据库中编写的存储过程会查找他们要搜索的单词在数据库的几个不同的领域。我真正想要的是Google风格的一种搜索来统治所有人的效果,用户不必指定他们希望找到他们要查找的单词的位置,他们只会在任何地方它出现在数据库中。这是行得通的。

现在我想添加的是搜索能够包含附加到记录的文档的文本。他们都是.doc或.pdf文件,但如果我无法搜索.pdf文件,它不会是世界的尽头。



在我的理想中我要做的就是找到一些能够索引包含文档的文件夹的索引(目前大约有100,000个文件,平均大约100k),并使用这个索引在我现有的数据库中填充一个表格,这样我就可以包含表在我的搜索。我喜欢它只包含它索引的每个独特单词的记录,以及引用包含该单词的文件系统中的文档的连接表。



鉴于这看起来很奇怪并且没有任何软件会这样做,或者任何接近它的软件,据我所知,您会推荐什么解决方案?服务器已经运行dtSearch,为我感兴趣的文件建立索引。但是,尽管我可以通过文档试图弄清楚如何通过我自己的网页实现对此索引的搜索(我已经开始要做的事情,并发现沉重),这将不得不是一个单独的搜索到一个SQL数据库。我无法以统一的方式返回文件索引和数据库的结果。

因此,从最终希望将索引单词存储在数据库中开始,为了实现全文搜索,有人会提出什么建议?解决方案

SQL Server具有全文搜索功能(http ://msdn.microsoft.com/en-us/library/ms142571.aspx);这支持PDF和word文件(虽然有些皱纹 - 安装可能有点棘手)。链接是SQL Server 2008 - 但自SQL Server 2000以来,该功能一直存在。

因此,超简单化 - 您的解决方案需要您将文档加载到SQL Server中,并修改您的存储过程以使用内置的免费文本搜索功能查询它们。



保持文档系统和数据库版本的文档同步可能是一个挑战,但除此之外,我认为解决方案应该相当简单。 b $ b

I'm trying to come up with a simple solution to a problem I have because all of those I have found so far just seem too complicated!

The situation is that we use a proprietary application for managing most aspects of our business. It has an SQL Server 2005 backend database, which is quite large. The application also allows the attaching of Word and PDF documents to records, which we use extensively, and these are stored in the file system on the server, with the filenames referenced in the database. Unfortunately the search facilities in the application are poor, so I'm trying to build my own version.

So far I've got a neat ASP.NET page with a search box which will allow users to enter words to search for, as well as filter their results on other fields, such as department, date, etc. The Stored Procedure I've written in the database looks for the words they're searching for in several different fields in the database. What I'm really aiming for is Google-style 'one search to rule them all' effect, where the user doesn't have to specify where they expect to find the word they're looking for, they will just get hits anywhere that it appears in the database. And this is working.

What I want to add now is the ability for the search to include the text of the documents which are 'attached' to records. They are all either .doc or .pdf files but if I couldn't search the .pdf files it wouldn't be the end of the world.

In my ideal world what I'd do is find some software which would index the folder containing the documents (currently there are around 100,000 of them, averaging about 100k) and populate a table in my existing database with this index so that I could then just include that table in my search. I'd love it to just contain a record for each unique word it indexed and a join table referencing documents in the file system containing that word.

Given that this seems fanciful and there isn't any software that will do this, or anything close to it, as far as I can see, what solution would you recommend? The server already has dtSearch running on it, indexing the very files I'm interested in. However, whilst I could wade through the documentation trying to figure out how to implement a search of this index through my own webpage (which I've started to do, and found heavy going), that would have to be a separate search to the one of the SQL database. I couldn't return results from the file index and the database in a unified way.

So, starting from the ultimate wish of having the indexed words stored in the database, with a view to implementing full-text searching on that, what would anyone suggest?

解决方案

SQL Server has full text search (http://msdn.microsoft.com/en-us/library/ms142571.aspx); this supports both PDF and word files (though with some wrinkles - installation can be a bit tricky). The link is to SQL Server 2008 - but the feature's been presence since SQL Server 2000.

So, super simplistically - your solution would require you to load the documents into SQL Server, and amend your stored proc to query them using the built-in free text search features.

Keeping the file system and database versions of the document synchronized could be a challenge, but other than that, I think the solution should be fairly straightforward.

这篇关于索引Word / PDF文档从文件系统到SQL Server的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆