以PDF或Word文档Sitecore的文本搜索 [英] Sitecore text search in PDF or Word documents

查看:263
本文介绍了以PDF或Word文档Sitecore的文本搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找出它是否可以配置Sitecore的Lucene搜索引擎索引PDF或Word文档?我看在Sitecore的支持网站这个文件(的http://sdn.sitecore.net/upload/sitecore6/65/sitecore_search_and_indexing_sc60-65-a4.pdf)但它提到创建文件爬虫类建议,我认为这是唯一可能通过编写定制code来实现这一点。如果我需要编写自定义的code要做到这一点,就我还需要有一些API,以提取PDF文档中的文本内容?

I want to find out if it's possible to configure Sitecore's Lucene search engine to index PDF or Word documents? I've looked on the Sitecore support site at this document (http://sdn.sitecore.net/upload/sitecore6/65/sitecore_search_and_indexing_sc60-65-a4.pdf) but it mentions creating a file crawler class which suggests to me that it's only possible to achieve this by writing custom code. If I do need to write custom code to do this, would I also need to have some API in order to extract the text content from PDF documents?

推荐答案

我最近不得不这样做对我的一个项目类似的东西。
看一看<一个href=\"http://stackoverflow.com/questions/4014337/how-to-index-word-2003-2007-and-2010-documents-using-lucene-net\">How使用索引Word 2003中,2007年和2010文档Lucene.NET 。

I've recently had to do something similar on one of my projects. Have a look at How to index Word 2003, 2007 and 2010 documents using Lucene.NET.

我结束了创建该处理的MS Office文档(XP,2003,2007和2010格式)和PDF文档的自定义索引:

I ended up creating a custom indexer which handled MS Office documents (XP,2003,2007 and 2010 format) and PDF documents:


  • 对于您可以使用索引XP-2003 MS Office文档的IFilter 的建成到操作系统(假设你使用Windows Server 2003或更高版本)

  • 对于索引2007-2010 MS Office文档,你需要安装微软Office 2010的过滤包

  • 对于索引PDF文档我强烈建议使用福昕PDF IFilter的。这不是免费的,但确实比的Adobe PDF IFilter的一个更好的工作。

  • For indexing XP-2003 MS Office documents you can use IFilters built into the OS (assuming you are using Windows Server 2003 or newer)
  • For indexing 2007-2010 MS Office documents you will need to install Microsoft Office 2010 Filter Packs
  • For indexing PDF documents I strongly recommend using Foxit PDF IFilter. It is not free, but does a much better job than the Adobe PDF IFilter.

注:的不要浪费与Adobe PDF IFilter的你的时间:它无法读取有效的PDF文件,为很多慢。福昕PDF IFilter的目的是充分利用多核CPU和执行大型文档要好得多。

Note: Don't waste your time with Adobe PDF IFilter: it fails to read valid PDF files and is a lot slower. Foxit IFilter is designed to take advantage of multi-core CPUs and performs much better on large documents.

这篇关于以PDF或Word文档Sitecore的文本搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆