以PDF或Word文档Sitecore的文本搜索 [英] Sitecore text search in PDF or Word documents

查看：263 发布时间：2016/6/9 19:17:31 c# asp.net sitecore sitecore6 sitecore-media-library

本文介绍了以PDF或Word文档Sitecore的文本搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想找出它是否可以配置Sitecore的Lucene搜索引擎索引PDF或Word文档？我看在Sitecore的支持网站这个文件（的http://sdn.sitecore.net/upload/sitecore6/65/sitecore_search_and_indexing_sc60-65-a4.pdf)但它提到创建文件爬虫类建议，我认为这是唯一可能通过编写定制code来实现这一点。如果我不需要编写自定义的code要做到这一点，就我还需要有一些API，以提取PDF文档中的文本内容？

I want to find out if it's possible to configure Sitecore's Lucene search engine to index PDF or Word documents? I've looked on the Sitecore support site at this document (http://sdn.sitecore.net/upload/sitecore6/65/sitecore_search_and_indexing_sc60-65-a4.pdf) but it mentions creating a file crawler class which suggests to me that it's only possible to achieve this by writing custom code. If I do need to write custom code to do this, would I also need to have some API in order to extract the text content from PDF documents?

推荐答案

我最近不得不这样做对我的一个项目类似的东西。
看一看<一个href=\"http://stackoverflow.com/questions/4014337/how-to-index-word-2003-2007-and-2010-documents-using-lucene-net\">How使用索引Word 2003中，2007年和2010文档Lucene.NET 。

I've recently had to do something similar on one of my projects. Have a look at How to index Word 2003, 2007 and 2010 documents using Lucene.NET.

我结束了创建该处理的MS Office文档（XP，2003,2007和2010格式）和PDF文档的自定义索引：

I ended up creating a custom indexer which handled MS Office documents (XP,2003,2007 and 2010 format) and PDF documents:

对于您可以使用索引XP-2003 MS Office文档的IFilter 的建成到操作系统（假设你使用Windows Server 2003或更高版本）

对于索引2007-2010 MS Office文档，你需要安装微软Office 2010的过滤包

对于索引PDF文档我强烈建议使用福昕PDF IFilter的。这不是免费的，但确实比的Adobe PDF IFilter的一个更好的工作。

For indexing XP-2003 MS Office documents you can use IFilters built into the OS (assuming you are using Windows Server 2003 or newer)
For indexing 2007-2010 MS Office documents you will need to install Microsoft Office 2010 Filter Packs
For indexing PDF documents I strongly recommend using Foxit PDF IFilter. It is not free, but does a much better job than the Adobe PDF IFilter.

的注：的不要浪费与Adobe PDF IFilter的你的时间：它无法读取有效的PDF文件，为很多慢。福昕PDF IFilter的目的是充分利用多核CPU和执行大型文档要好得多。

Note: Don't waste your time with Adobe PDF IFilter: it fails to read valid PDF files and is a lot slower. Foxit IFilter is designed to take advantage of multi-core CPUs and performs much better on large documents.

这篇关于以PDF或Word文档Sitecore的文本搜索的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

以PDF或Word文档Sitecore的文本搜索 [英] Sitecore text search in PDF or Word documents

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

以PDF或Word文档Sitecore的文本搜索 [英] Sitecore text search in PDF or Word documents

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭