使用 Lucene.NET 索引 .PDF、.XLS、.DOC、.PPT [英] Indexing .PDF, .XLS, .DOC, .PPT using Lucene.NET

查看:27
本文介绍了使用 Lucene.NET 索引 .PDF、.XLS、.DOC、.PPT的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我听说过 Lucene.Net 并且我听说过 Apache Tika.问题是 - 如何使用 C# 与 Java 索引这些文档?我认为问题在于没有从这些文档类型中提取相关文本的 Tika 的 .Net 等价物.

I've heard of Lucene.Net and I've heard of Apache Tika. The question is - how do I index these documents using C# vs Java? I think the issue is that there is no .Net equivalent of Tika which extracts relevant text from these document types.

更新 - 2011 年 2 月 5 日

根据给定的回复,目前似乎不是 Tika 的 native .Net 等价物.提到了 2 个有趣的项目,每个项目都有自己的兴趣:

Based on given responses, it seems that the is not currently a native .Net equivalent of Tika. 2 interesting projects were mentioned that are each interesting in their own right:

  1. Xapian 项目(http://xapian.org/)- 替代方案到用非托管代码编写的 Lucene.该项目声称支持允许 C# 绑定的swig".在 Xapian 项目中有一个名为 Omega 的现成搜索引擎.Omega 使用各种开源组件从各种文档类型中提取文本.
  2. IKVM.NET(http://www.ikvm.net/) - 允许从 .Net 运行 Java.可以找到使用 IKVM 运行 Tika 的示例 此处.
  1. Xapian Project (http://xapian.org/) - An alternative to Lucene written in unmanaged code. The project claims to support "swig" which allows for C# bindings. Within the Xapian Project there is an out-of-the-box search engine called Omega. Omega uses a variety of open source components to extract text from various document types.
  2. IKVM.NET (http://www.ikvm.net/) - Allows Java to be run from .Net. An example of using IKVM to run Tika can be found here.

鉴于上述 2 个项目,我看到了几个选项.要提取文本,我可以 a) 使用 Omega 正在使用的相同组件或 b) 使用 IKVM 来运行 Tika.对我来说,选项 b) 似乎更清晰,因为只有 2 个依赖项.

Given the above 2 projects, I see a couple of options. To extract the text, I could either a) use the same components that Omega is using or b) use IKVM to run Tika. To me, option b) seems cleaner as there are only 2 dependencies.

有趣的是,现在有几个搜索引擎可能可以从 .Net 中使用.有 Xapian、Lucene.Net 甚至 Lucene(使用 IKVM).

The interesting part is that now there are several search engines that could probably be used from .Net. There is Xapian, Lucene.Net or even Lucene (using IKVM).

更新 - 2011 年 2 月 7 日

另一个答案是建议我查看 ifilters.事实证明,这正是 MS 用于 Windows 搜索的方式,因此 Office ifilter 很容易获得.此外,还有一些 PDF ifilters.缺点是它们是在非托管代码中实现的,因此需要 COM 互操作才能使用它们.我在 DotLucene.NET 存档(不再是活动项目)上发现了以下代码片段:

Another answer came in recommending that I check out ifilters. As it turns out, this is what MS uses for windows search so Office ifilters are readily available. Also, there are some PDF ifilters out there. The downside is that they are implemented in unmanaged code, so COM interop is necessary to use them. I found the below code snippit on a DotLucene.NET archive (no longer an active project):

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Text;

namespace IFilter
{
    [Flags]
    public enum IFILTER_INIT : uint
    {
        NONE = 0,
        CANON_PARAGRAPHS = 1,
        HARD_LINE_BREAKS = 2,
        CANON_HYPHENS = 4,
        CANON_SPACES = 8,
        APPLY_INDEX_ATTRIBUTES = 16,
        APPLY_CRAWL_ATTRIBUTES = 256,
        APPLY_OTHER_ATTRIBUTES = 32,
        INDEXING_ONLY = 64,
        SEARCH_LINKS = 128,
        FILTER_OWNED_VALUE_OK = 512
    }

    public enum CHUNK_BREAKTYPE
    {
        CHUNK_NO_BREAK = 0,
        CHUNK_EOW = 1,
        CHUNK_EOS = 2,
        CHUNK_EOP = 3,
        CHUNK_EOC = 4
    }

    [Flags]
    public enum CHUNKSTATE
    {
        CHUNK_TEXT = 0x1,
        CHUNK_VALUE = 0x2,
        CHUNK_FILTER_OWNED_VALUE = 0x4
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct PROPSPEC
    {
        public uint ulKind;
        public uint propid;
        public IntPtr lpwstr;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FULLPROPSPEC
    {
        public Guid guidPropSet;
        public PROPSPEC psProperty;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct STAT_CHUNK
    {
        public uint idChunk;
        [MarshalAs(UnmanagedType.U4)] public CHUNK_BREAKTYPE breakType;
        [MarshalAs(UnmanagedType.U4)] public CHUNKSTATE flags;
        public uint locale;
        [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;
        public uint idChunkSource;
        public uint cwcStartSource;
        public uint cwcLenSource;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FILTERREGION
    {
        public uint idChunk;
        public uint cwcStart;
        public uint cwcExtent;
    }

    [ComImport]
    [Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]
    [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
    public interface IFilter
    {
        [PreserveSig]
        int Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, uint cAttributes, [MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes, ref uint pdwFlags);

        [PreserveSig]
        int GetChunk(out STAT_CHUNK pStat);

        [PreserveSig]
        int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        void GetValue(ref UIntPtr ppPropValue);
        void BindRegion([MarshalAs(UnmanagedType.Struct)] FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);
    }

    [ComImport]
    [Guid("f07f3920-7b8c-11cf-9be8-00aa004b9986")]
    public class CFilter
    {
    }

    public class IFilterConstants
    {
        public const uint PID_STG_DIRECTORY = 0x00000002;
        public const uint PID_STG_CLASSID = 0x00000003;
        public const uint PID_STG_STORAGETYPE = 0x00000004;
        public const uint PID_STG_VOLUME_ID = 0x00000005;
        public const uint PID_STG_PARENT_WORKID = 0x00000006;
        public const uint PID_STG_SECONDARYSTORE = 0x00000007;
        public const uint PID_STG_FILEINDEX = 0x00000008;
        public const uint PID_STG_LASTCHANGEUSN = 0x00000009;
        public const uint PID_STG_NAME = 0x0000000a;
        public const uint PID_STG_PATH = 0x0000000b;
        public const uint PID_STG_SIZE = 0x0000000c;
        public const uint PID_STG_ATTRIBUTES = 0x0000000d;
        public const uint PID_STG_WRITETIME = 0x0000000e;
        public const uint PID_STG_CREATETIME = 0x0000000f;
        public const uint PID_STG_ACCESSTIME = 0x00000010;
        public const uint PID_STG_CHANGETIME = 0x00000011;
        public const uint PID_STG_CONTENTS = 0x00000013;
        public const uint PID_STG_SHORTNAME = 0x00000014;
        public const int FILTER_E_END_OF_CHUNKS = (unchecked((int) 0x80041700));
        public const int FILTER_E_NO_MORE_TEXT = (unchecked((int) 0x80041701));
        public const int FILTER_E_NO_MORE_VALUES = (unchecked((int) 0x80041702));
        public const int FILTER_E_NO_TEXT = (unchecked((int) 0x80041705));
        public const int FILTER_E_NO_VALUES = (unchecked((int) 0x80041706));
        public const int FILTER_S_LAST_TEXT = (unchecked((int) 0x00041709));
    }

    /// 
    /// IFilter return codes
    /// 
    public enum IFilterReturnCodes : uint
    {
        /// 
        /// Success
        /// 
        S_OK = 0,
        /// 
        /// The function was denied access to the filter file. 
        /// 
        E_ACCESSDENIED = 0x80070005,
        /// 
        /// The function encountered an invalid handle, probably due to a low-memory situation. 
        /// 
        E_HANDLE = 0x80070006,
        /// 
        /// The function received an invalid parameter.
        /// 
        E_INVALIDARG = 0x80070057,
        /// 
        /// Out of memory
        /// 
        E_OUTOFMEMORY = 0x8007000E,
        /// 
        /// Not implemented
        /// 
        E_NOTIMPL = 0x80004001,
        /// 
        /// Unknown error
        /// 
        E_FAIL = 0x80000008,
        /// 
        /// File not filtered due to password protection
        /// 
        FILTER_E_PASSWORD = 0x8004170B,
        /// 
        /// The document format is not recognised by the filter
        /// 
        FILTER_E_UNKNOWNFORMAT = 0x8004170C,
        /// 
        /// No text in current chunk
        /// 
        FILTER_E_NO_TEXT = 0x80041705,
        /// 
        /// No more chunks of text available in object
        /// 
        FILTER_E_END_OF_CHUNKS = 0x80041700,
        /// 
        /// No more text available in chunk
        /// 
        FILTER_E_NO_MORE_TEXT = 0x80041701,
        /// 
        /// No more property values available in chunk
        /// 
        FILTER_E_NO_MORE_VALUES = 0x80041702,
        /// 
        /// Unable to access object
        /// 
        FILTER_E_ACCESS = 0x80041703,
        /// 
        /// Moniker doesn't cover entire region
        /// 
        FILTER_W_MONIKER_CLIPPED = 0x00041704,
        /// 
        /// Unable to bind IFilter for embedded object
        /// 
        FILTER_E_EMBEDDING_UNAVAILABLE = 0x80041707,
        /// 
        /// Unable to bind IFilter for linked object
        /// 
        FILTER_E_LINK_UNAVAILABLE = 0x80041708,
        /// 
        /// This is the last text in the current chunk
        /// 
        FILTER_S_LAST_TEXT = 0x00041709,
        /// 
        /// This is the last value in the current chunk
        /// 
        FILTER_S_LAST_VALUES = 0x0004170A
    }

    /// 
    /// Convenience class which provides static methods to extract text from files using installed IFilters
    /// 
    public class DefaultParser
    {
        public DefaultParser()
        {
        }

        [DllImport("query.dll", CharSet = CharSet.Unicode)]
        private extern static int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk);

        private static IFilter loadIFilter(string filename)
        {
            object outer = null;
            IFilter filter = null;

            // Try to load the corresponding IFilter
            int resultLoad = LoadIFilter(filename,  outer, ref filter);
            if (resultLoad != (int) IFilterReturnCodes.S_OK)
            {
                return null;
            }
            return filter;
        }

        public static bool IsParseable(string filename)
        {
            return loadIFilter(filename) != null;
        }

        public static string Extract(string path)
        {
            StringBuilder sb = new StringBuilder();
            IFilter filter = null;

            try
            {
                filter = loadIFilter(path);

                if (filter == null)
                    return String.Empty;

                uint i = 0;
                STAT_CHUNK ps = new STAT_CHUNK();

                IFILTER_INIT iflags =
                    IFILTER_INIT.CANON_HYPHENS |
                    IFILTER_INIT.CANON_PARAGRAPHS |
                    IFILTER_INIT.CANON_SPACES |
                    IFILTER_INIT.APPLY_CRAWL_ATTRIBUTES |
                    IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
                    IFILTER_INIT.APPLY_OTHER_ATTRIBUTES |
                    IFILTER_INIT.HARD_LINE_BREAKS |
                    IFILTER_INIT.SEARCH_LINKS |
                    IFILTER_INIT.FILTER_OWNED_VALUE_OK;

                if (filter.Init(iflags, 0, null, ref i) != (int) IFilterReturnCodes.S_OK)
                    throw new Exception("Problem initializing an IFilter for:
" + path + " 

");

                while (filter.GetChunk(out ps) == (int) (IFilterReturnCodes.S_OK))
                {
                    if (ps.flags == CHUNKSTATE.CHUNK_TEXT)
                    {
                        IFilterReturnCodes scode = 0;
                        while (scode == IFilterReturnCodes.S_OK || scode == IFilterReturnCodes.FILTER_S_LAST_TEXT)
                        {
                            uint pcwcBuffer = 65536;
                            System.Text.StringBuilder sbBuffer = new System.Text.StringBuilder((int)pcwcBuffer);

                            scode = (IFilterReturnCodes) filter.GetText(ref pcwcBuffer, sbBuffer);

                            if (pcwcBuffer > 0 && sbBuffer.Length > 0)
                            {
                                if (sbBuffer.Length < pcwcBuffer) // Should never happen, but it happens !
                                    pcwcBuffer = (uint)sbBuffer.Length;

                                sb.Append(sbBuffer.ToString(0, (int) pcwcBuffer));
                                sb.Append(" "); // "
"
                            }

                        }
                    }

                }
            }
            finally
            {
                if (filter != null) {
                    Marshal.ReleaseComObject (filter);
                    System.GC.Collect();
                    System.GC.WaitForPendingFinalizers();
                }
            }

            return sb.ToString();
        }
    }
}

目前,这似乎是在 Windows 服务器上使用 .NET 平台从文档中提取文本的最佳方式.谢谢大家的帮助.

At the moment, this seems like the best way to extract text from documents using the .NET platform on a Windows server. Thanks everybody for your help.

更新 - 2011 年 3 月 8 日

虽然我仍然认为 ifilters 是一个不错的方法,但我认为如果您希望使用来自 .NET 的 Lucene 来索引文档,一个非常好的选择是使用 Solr.当我第一次开始研究这个主题时,我从未听说过 Solr.所以,对于那些没有的人来说,Solr 是一个独立的搜索服务,在 Lucene 之上用 Java 编写.这个想法是你可以在有防火墙的机器上启动 Solr,并从你的 .NET 应用程序通过 HTTP 与它通信.Solr 真的像服务一样编写,可以做 Lucene 可以做的所有事情(包括使用 Tika 从 .PDF、.XLS、.DOC、.PPT 等中提取文本),然后是一些.Solr 似乎也有一个非常活跃的社区,这是我对 Lucene.NET 不确定的一件事.

While I still think that ifilters are a good way to go, I think if you are looking to index documents using Lucene from .NET, a very good alternative would be to use Solr. When I first started researching this topic, I had never heard of Solr. So, for those of you who have not either, Solr is a stand-alone search service, written in Java on top of Lucene. The idea is that you can fire up Solr on a firewalled machine, and communicate with it via HTTP from your .NET application. Solr is truly written like a service and can do everything Lucene can do, (including using Tika extract text from .PDF, .XLS, .DOC, .PPT, etc), and then some. Solr seems to have a very active community as well, which is one thing I am not to sure of with regards to Lucene.NET.

推荐答案

您还可以查看 ifilters - 如果您搜索 asp.net ifilters,有很多资源:

You can also check out ifilters - there are a number of resources if you do a search for asp.net ifilters:

当然,如果您将它分发到客户端系统,则会增加麻烦,因为您要么需要在分发中包含 ifilters 并在他们的机器上安装带有您的应用程序的 ifilter,否则它们将缺乏提取文本的能力来自他们没有 ifilter 的任何文件.

Of course, there is added hassle if you are distributing this to client systems, because you will either need to include the ifilters with your distribution and install those with your app on their machine, or they will lack the ability to extract text from any files they don't have ifilters for.

这篇关于使用 Lucene.NET 索引 .PDF、.XLS、.DOC、.PPT的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆