索引.PDF,.XLS,.DOC,.PPT使用Lucene.NET [英] Indexing .PDF, .XLS, .DOC, .PPT using Lucene.NET

查看:241
本文介绍了索引.PDF,.XLS,.DOC,.PPT使用Lucene.NET的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我听说 Lucene.Net 的,我听说过的Apache提卡的。现在的问题是 - 我怎么指标使用C#和Java这些文件?我认为这个问题是不存在的.Net相当于提卡的从这些文档类型中提取相关的文字。

I've heard of Lucene.Net and I've heard of Apache Tika. The question is - how do I index these documents using C# vs Java? I think the issue is that there is no .Net equivalent of Tika which extracts relevant text from these document types.

更新 - 2011年2月5日

根据给定的反应,似乎目前还不是的原始的.Net相当于提卡的。提到2有趣的项目是每一个有趣的在自己的权利:

Based on given responses, it seems that the is not currently a native .Net equivalent of Tika. 2 interesting projects were mentioned that are each interesting in their own right:


  1. 的Xapian项目 http://xapian.org/ ) - Lucene的另一种写在非托管$ C $℃。该项目声称支持痛饮,允许使用C#绑定。内Xapian的项目有一个名为欧米茄一个不折不扣的即装即用的搜索引擎。欧米茄采用多种开源组件提取各种文件类型的文本。

  2. IKVM.NET http://www.ikvm.net/ ) - 允许Java来从净运行。使用IKVM运行提卡的例子可以发现<一href=\"http://blogs.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm\">here.

  1. Xapian Project (http://xapian.org/) - An alternative to Lucene written in unmanaged code. The project claims to support "swig" which allows for C# bindings. Within the Xapian Project there is an out-of-the-box search engine called Omega. Omega uses a variety of open source components to extract text from various document types.
  2. IKVM.NET (http://www.ikvm.net/) - Allows Java to be run from .Net. An example of using IKVM to run Tika can be found here.

鉴于上述2个项目,我看到一对夫妇的选择。要提取文本,我既可以一)使用欧米茄正在使用或b)使用IKVM运行蒂卡相同的组件。对我来说,选项B)似乎更清洁,因为只有2依赖关系。

Given the above 2 projects, I see a couple of options. To extract the text, I could either a) use the same components that Omega is using or b) use IKVM to run Tika. To me, option b) seems cleaner as there are only 2 dependencies.

有趣的是,现在有几个搜索引擎,也许可以从.net使用。有Xapian的,Lucene.Net甚至Lucene的(使用IKVM)。

The interesting part is that now there are several search engines that could probably be used from .Net. There is Xapian, Lucene.Net or even Lucene (using IKVM).

更新 - 2011年2月7日

另一个答案进来了,建议我检查出的IFilter。事实证明,这是MS使用的Windows搜索,以便办公室的IFilter都一应俱全。此外,还有一些PDF IFilter的在那里。缺点是,他们在非托管code实现的,所以COM互操作需要使用它们。我发现一个DotLucene.NET档案(不再活动项目)以下code SNIPPIT:

Another answer came in recommending that I check out ifilters. As it turns out, this is what MS uses for windows search so Office ifilters are readily available. Also, there are some PDF ifilters out there. The downside is that they are implemented in unmanaged code, so COM interop is necessary to use them. I found the below code snippit on a DotLucene.NET archive (no longer an active project):

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Text;

namespace IFilter
{
    [Flags]
    public enum IFILTER_INIT : uint
    {
        NONE = 0,
        CANON_PARAGRAPHS = 1,
        HARD_LINE_BREAKS = 2,
        CANON_HYPHENS = 4,
        CANON_SPACES = 8,
        APPLY_INDEX_ATTRIBUTES = 16,
        APPLY_CRAWL_ATTRIBUTES = 256,
        APPLY_OTHER_ATTRIBUTES = 32,
        INDEXING_ONLY = 64,
        SEARCH_LINKS = 128,
        FILTER_OWNED_VALUE_OK = 512
    }

    public enum CHUNK_BREAKTYPE
    {
        CHUNK_NO_BREAK = 0,
        CHUNK_EOW = 1,
        CHUNK_EOS = 2,
        CHUNK_EOP = 3,
        CHUNK_EOC = 4
    }

    [Flags]
    public enum CHUNKSTATE
    {
        CHUNK_TEXT = 0x1,
        CHUNK_VALUE = 0x2,
        CHUNK_FILTER_OWNED_VALUE = 0x4
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct PROPSPEC
    {
        public uint ulKind;
        public uint propid;
        public IntPtr lpwstr;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FULLPROPSPEC
    {
        public Guid guidPropSet;
        public PROPSPEC psProperty;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct STAT_CHUNK
    {
        public uint idChunk;
        [MarshalAs(UnmanagedType.U4)] public CHUNK_BREAKTYPE breakType;
        [MarshalAs(UnmanagedType.U4)] public CHUNKSTATE flags;
        public uint locale;
        [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;
        public uint idChunkSource;
        public uint cwcStartSource;
        public uint cwcLenSource;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FILTERREGION
    {
        public uint idChunk;
        public uint cwcStart;
        public uint cwcExtent;
    }

    [ComImport]
    [Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]
    [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
    public interface IFilter
    {
        [PreserveSig]
        int Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, uint cAttributes, [MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes, ref uint pdwFlags);

        [PreserveSig]
        int GetChunk(out STAT_CHUNK pStat);

        [PreserveSig]
        int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        void GetValue(ref UIntPtr ppPropValue);
        void BindRegion([MarshalAs(UnmanagedType.Struct)] FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);
    }

    [ComImport]
    [Guid("f07f3920-7b8c-11cf-9be8-00aa004b9986")]
    public class CFilter
    {
    }

    public class IFilterConstants
    {
        public const uint PID_STG_DIRECTORY = 0x00000002;
        public const uint PID_STG_CLASSID = 0x00000003;
        public const uint PID_STG_STORAGETYPE = 0x00000004;
        public const uint PID_STG_VOLUME_ID = 0x00000005;
        public const uint PID_STG_PARENT_WORKID = 0x00000006;
        public const uint PID_STG_SECONDARYSTORE = 0x00000007;
        public const uint PID_STG_FILEINDEX = 0x00000008;
        public const uint PID_STG_LASTCHANGEUSN = 0x00000009;
        public const uint PID_STG_NAME = 0x0000000a;
        public const uint PID_STG_PATH = 0x0000000b;
        public const uint PID_STG_SIZE = 0x0000000c;
        public const uint PID_STG_ATTRIBUTES = 0x0000000d;
        public const uint PID_STG_WRITETIME = 0x0000000e;
        public const uint PID_STG_CREATETIME = 0x0000000f;
        public const uint PID_STG_ACCESSTIME = 0x00000010;
        public const uint PID_STG_CHANGETIME = 0x00000011;
        public const uint PID_STG_CONTENTS = 0x00000013;
        public const uint PID_STG_SHORTNAME = 0x00000014;
        public const int FILTER_E_END_OF_CHUNKS = (unchecked((int) 0x80041700));
        public const int FILTER_E_NO_MORE_TEXT = (unchecked((int) 0x80041701));
        public const int FILTER_E_NO_MORE_VALUES = (unchecked((int) 0x80041702));
        public const int FILTER_E_NO_TEXT = (unchecked((int) 0x80041705));
        public const int FILTER_E_NO_VALUES = (unchecked((int) 0x80041706));
        public const int FILTER_S_LAST_TEXT = (unchecked((int) 0x00041709));
    }

    /// 
    /// IFilter return codes
    /// 
    public enum IFilterReturnCodes : uint
    {
        /// 
        /// Success
        /// 
        S_OK = 0,
        /// 
        /// The function was denied access to the filter file. 
        /// 
        E_ACCESSDENIED = 0x80070005,
        /// 
        /// The function encountered an invalid handle, probably due to a low-memory situation. 
        /// 
        E_HANDLE = 0x80070006,
        /// 
        /// The function received an invalid parameter.
        /// 
        E_INVALIDARG = 0x80070057,
        /// 
        /// Out of memory
        /// 
        E_OUTOFMEMORY = 0x8007000E,
        /// 
        /// Not implemented
        /// 
        E_NOTIMPL = 0x80004001,
        /// 
        /// Unknown error
        /// 
        E_FAIL = 0x80000008,
        /// 
        /// File not filtered due to password protection
        /// 
        FILTER_E_PASSWORD = 0x8004170B,
        /// 
        /// The document format is not recognised by the filter
        /// 
        FILTER_E_UNKNOWNFORMAT = 0x8004170C,
        /// 
        /// No text in current chunk
        /// 
        FILTER_E_NO_TEXT = 0x80041705,
        /// 
        /// No more chunks of text available in object
        /// 
        FILTER_E_END_OF_CHUNKS = 0x80041700,
        /// 
        /// No more text available in chunk
        /// 
        FILTER_E_NO_MORE_TEXT = 0x80041701,
        /// 
        /// No more property values available in chunk
        /// 
        FILTER_E_NO_MORE_VALUES = 0x80041702,
        /// 
        /// Unable to access object
        /// 
        FILTER_E_ACCESS = 0x80041703,
        /// 
        /// Moniker doesn't cover entire region
        /// 
        FILTER_W_MONIKER_CLIPPED = 0x00041704,
        /// 
        /// Unable to bind IFilter for embedded object
        /// 
        FILTER_E_EMBEDDING_UNAVAILABLE = 0x80041707,
        /// 
        /// Unable to bind IFilter for linked object
        /// 
        FILTER_E_LINK_UNAVAILABLE = 0x80041708,
        /// 
        /// This is the last text in the current chunk
        /// 
        FILTER_S_LAST_TEXT = 0x00041709,
        /// 
        /// This is the last value in the current chunk
        /// 
        FILTER_S_LAST_VALUES = 0x0004170A
    }

    /// 
    /// Convenience class which provides static methods to extract text from files using installed IFilters
    /// 
    public class DefaultParser
    {
        public DefaultParser()
        {
        }

        [DllImport("query.dll", CharSet = CharSet.Unicode)]
        private extern static int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk);

        private static IFilter loadIFilter(string filename)
        {
            object outer = null;
            IFilter filter = null;

            // Try to load the corresponding IFilter
            int resultLoad = LoadIFilter(filename,  outer, ref filter);
            if (resultLoad != (int) IFilterReturnCodes.S_OK)
            {
                return null;
            }
            return filter;
        }

        public static bool IsParseable(string filename)
        {
            return loadIFilter(filename) != null;
        }

        public static string Extract(string path)
        {
            StringBuilder sb = new StringBuilder();
            IFilter filter = null;

            try
            {
                filter = loadIFilter(path);

                if (filter == null)
                    return String.Empty;

                uint i = 0;
                STAT_CHUNK ps = new STAT_CHUNK();

                IFILTER_INIT iflags =
                    IFILTER_INIT.CANON_HYPHENS |
                    IFILTER_INIT.CANON_PARAGRAPHS |
                    IFILTER_INIT.CANON_SPACES |
                    IFILTER_INIT.APPLY_CRAWL_ATTRIBUTES |
                    IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
                    IFILTER_INIT.APPLY_OTHER_ATTRIBUTES |
                    IFILTER_INIT.HARD_LINE_BREAKS |
                    IFILTER_INIT.SEARCH_LINKS |
                    IFILTER_INIT.FILTER_OWNED_VALUE_OK;

                if (filter.Init(iflags, 0, null, ref i) != (int) IFilterReturnCodes.S_OK)
                    throw new Exception("Problem initializing an IFilter for:\n" + path + " \n\n");

                while (filter.GetChunk(out ps) == (int) (IFilterReturnCodes.S_OK))
                {
                    if (ps.flags == CHUNKSTATE.CHUNK_TEXT)
                    {
                        IFilterReturnCodes scode = 0;
                        while (scode == IFilterReturnCodes.S_OK || scode == IFilterReturnCodes.FILTER_S_LAST_TEXT)
                        {
                            uint pcwcBuffer = 65536;
                            System.Text.StringBuilder sbBuffer = new System.Text.StringBuilder((int)pcwcBuffer);

                            scode = (IFilterReturnCodes) filter.GetText(ref pcwcBuffer, sbBuffer);

                            if (pcwcBuffer > 0 && sbBuffer.Length > 0)
                            {
                                if (sbBuffer.Length < pcwcBuffer) // Should never happen, but it happens !
                                    pcwcBuffer = (uint)sbBuffer.Length;

                                sb.Append(sbBuffer.ToString(0, (int) pcwcBuffer));
                                sb.Append(" "); // "\r\n"
                            }

                        }
                    }

                }
            }
            finally
            {
                if (filter != null) {
                    Marshal.ReleaseComObject (filter);
                    System.GC.Collect();
                    System.GC.WaitForPendingFinalizers();
                }
            }

            return sb.ToString();
        }
    }
}

目前,这似乎是提取使用Windows服务器上的.NET平台文档中的文本的最佳方式。谢谢大家的帮助。

At the moment, this seems like the best way to extract text from documents using the .NET platform on a Windows server. Thanks everybody for your help.

更新 - 2011年3月8日

虽然我仍然认为的IFilter是一个很好的路要走,我认为,如果你正在寻找使用Lucene从.NET索引文档,一个非常好的替代方法是使用Solr 。当我第一次开始研究这个话题,我从来没有听说过的Solr。因此,对于那些你也没有吃谁,Solr的是一个独立的搜索服务,用Java编写的Lucene之上。这个想法是,你可以火起来的Solr了防火墙的机器上,并用它通过HTTP从你的.NET应用程序通信。 Solr的是真正写的一个服务,可以做一切的Lucene能做的,(包括使用提卡提取.PDF,.XLS,.DOC,.PPT等文字),然后一些。 Solr的似乎有一个非常活跃的社区,以及,这​​是一件事,我不确信的问候Lucene.NET。

While I still think that ifilters are a good way to go, I think if you are looking to index documents using Lucene from .NET, a very good alternative would be to use Solr. When I first started researching this topic, I had never heard of Solr. So, for those of you who have not either, Solr is a stand-alone search service, written in Java on top of Lucene. The idea is that you can fire up Solr on a firewalled machine, and communicate with it via HTTP from your .NET application. Solr is truly written like a service and can do everything Lucene can do, (including using Tika extract text from .PDF, .XLS, .DOC, .PPT, etc), and then some. Solr seems to have a very active community as well, which is one thing I am not to sure of with regards to Lucene.NET.

推荐答案

您还可以检查出的IFilter - 有许多资源,如果你对asp.net的IFilter做一个搜索:

You can also check out ifilters - there are a number of resources if you do a search for asp.net ifilters:

  • http://www.codeproject.com/KB/cs/IFilter.aspx
  • http://en.wikipedia.org/wiki/IFilters
  • http://www.ifilter.org/
  • http://stackoverflow.com/questions/1535992/ifilter-or-sdk-for-many-file-types

当然,还有额外的麻烦,如果你正在分发该客户端系统,因为你要么需要包括你的发行版的IFilter和他们的计算机上安装那些与你的应用程序,否则将缺乏提取文本的能力从任何文件,他们没有对的IFilter

Of course, there is added hassle if you are distributing this to client systems, because you will either need to include the ifilters with your distribution and install those with your app on their machine, or they will lack the ability to extract text from any files they don't have ifilters for.

这篇关于索引.PDF,.XLS,.DOC,.PPT使用Lucene.NET的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆