如何在Lucene.Net 4.8中使用HIGH_COMPRESSION [英] How to use HIGH_COMPRESSION in Lucene.Net 4.8

查看:67
本文介绍了如何在Lucene.Net 4.8中使用HIGH_COMPRESSION的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试尽可能地压缩索引大小,请问有什么帮助吗? https://lucenenet.apache.org/docs/4.8.0-beta00013/api/core/Lucene.Net.Codecs.Compressing.CompressionMode.html#Lucene_Net_Codecs_Compressing_CompressionMode_HIGH_COMPRESSION

I'm trying to compress the index size as much as possible, Any help please? https://lucenenet.apache.org/docs/4.8.0-beta00013/api/core/Lucene.Net.Codecs.Compressing.CompressionMode.html#Lucene_Net_Codecs_Compressing_CompressionMode_HIGH_COMPRESSION

public class LuceneIndexer
    {
        private Analyzer _analyzer = new ArabicAnalyzer(Lucene.Net.Util.LuceneVersion.LUCENE_48);
        private string _indexPath;
        private Directory _indexDirectory;
        public IndexWriter _indexWriter;

        public LuceneIndexer(string indexPath)
        {
            this._indexPath = indexPath;
            _indexDirectory = new SimpleFSDirectory(new System.IO.DirectoryInfo(_indexPath));
        }

        public void BuildCompleteIndex(IEnumerable<Document> documents)
        {
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Lucene.Net.Util.LuceneVersion.LUCENE_48, _analyzer) { OpenMode = OpenMode.CREATE_OR_APPEND };
            indexWriterConfig.MaxBufferedDocs = 2;
            indexWriterConfig.RAMBufferSizeMB = 128;
            indexWriterConfig.MaxThreadStates = 2;

            _indexWriter = new IndexWriter(_indexDirectory, indexWriterConfig);

            _indexWriter.AddDocuments(documents);

            _indexWriter.Flush(true, true);
            _indexWriter.Commit();

            _indexWriter.Dispose();
        }

        
        public IEnumerable<Document> Search(string searchTerm, string searchField, int limit)
        {
            IndexReader indexReader = DirectoryReader.Open(_indexDirectory);
            var searcher = new IndexSearcher(indexReader);
            var termQuery = new TermQuery(new Term(searchField, searchTerm)); // Lucene.Net.Util.LuceneVersion.LUCENE_48, searchField, _analyzer
            var hits = searcher.Search(termQuery, limit).ScoreDocs;

            var documents = new List<Document>();
            foreach (var hit in hits)
            {
                documents.Add(searcher.Doc(hit.Doc));
            }

            _analyzer.Dispose();
            return documents;
        }

    }

推荐答案

首先要知道的是"Lucene索引"有很多方面.当不使用复合文件时,这将体现在创建的各种文件中.仅查看其中的两个,我们就可以讨论倒排索引(称为发布),也可以讨论存储的文档.据我所知,在这两个参数中,没有关于倒排索引压缩的任何可用的可调设置.

The first thing to know is that there are many aspects to the "Lucene Index". When not using compound files, this manifests in the various files that are created. Just looking at two of those, we can talk about the inverted index which is called postings and we can talk about the stored documents. Of these two, there aren't any readily available tunable settings regarding the compression of the inverted index as best I can tell.

HIGH_COMPRESSION模式与存储的字段有关.如果您不存储字段,而仅使用Lucene.Net创建反向索引,则为存储的字段打开高压缩率的工作不会减小"Lucene索引"的大小.

The HIGH_COMPRESSION mode relates to the stored fields. If you are not storing fields and you are only using Lucene.Net to create an inverted index then doing work to turn on high compression for stored fields won't reduce the size of the "Lucene Index".

也就是说,如果您正在存储字段,并且想对存储的字段数据使用高压缩率,那么您将需要创建自己的编解码器,该编解码器为存储的字段启用了高压缩率.为此,您首先需要一个具有高压缩率的Storedfields类.下面是这两个类,后面是使用我为您编写的新编解码器的单元测试.我没有在大量数据上尝试使用此代码来查看效果,我将其留给您作为练习,但这应该为使用High Compression压缩存储的字段指明了方向.

That said, if you are storing fields and want to use high compression on that stored fields data, then you will need to create your own codec that has high compression turned on for stored fields. And to do that, you will first need a Stored fields class that has high compression turned on. Below are those two classes followed by a unit test that uses this new codec that I have written for you. I haven't tried this code on a large amount of data to see the effect, I leave that for you as an exercise, but this should point the way to getting your stored fields compressed with High Compression.

/*
     * Licensed to the Apache Software Foundation (ASF) under one or more
     * contributor license agreements.  See the NOTICE file distributed with
     * this work for additional information regarding copyright ownership.
     * The ASF licenses this file to You under the Apache License, Version 2.0
     * (the "License"); you may not use this file except in compliance with
     * the License.  You may obtain a copy of the License at
     *
     *     http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */

public sealed class Lucene41StoredFieldsHighCompressionFormat : CompressingStoredFieldsFormat {
        /// <summary>
        /// Sole constructor. </summary>
        public Lucene41StoredFieldsHighCompressionFormat()
            : base("Lucene41StoredFieldsHighCompression", CompressionMode.HIGH_COMPRESSION, 1 << 14) {
        }
    }

以下是使用此高压缩格式的自定义编解码器:

Here is a custom codec to use this High Compression format:

/*
     * Licensed to the Apache Software Foundation (ASF) under one or more
     * contributor license agreements.  See the NOTICE file distributed with
     * this work for additional information regarding copyright ownership.
     * The ASF licenses this file to You under the Apache License, Version 2.0
     * (the "License"); you may not use this file except in compliance with
     * the License.  You may obtain a copy of the License at
     *
     *     http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */

    using Lucene40LiveDocsFormat = Lucene.Net.Codecs.Lucene40.Lucene40LiveDocsFormat;
    using Lucene41StoredFieldsFormat = Lucene.Net.Codecs.Lucene41.Lucene41StoredFieldsFormat;
    using Lucene42NormsFormat = Lucene.Net.Codecs.Lucene42.Lucene42NormsFormat;
    using Lucene42TermVectorsFormat = Lucene.Net.Codecs.Lucene42.Lucene42TermVectorsFormat;
    using PerFieldDocValuesFormat = Lucene.Net.Codecs.PerField.PerFieldDocValuesFormat;
    using PerFieldPostingsFormat = Lucene.Net.Codecs.PerField.PerFieldPostingsFormat;

    /// <summary>
    /// Implements the Lucene 4.6 index format, with configurable per-field postings
    /// and docvalues formats.
    /// <para/>
    /// If you want to reuse functionality of this codec in another codec, extend
    /// <see cref="FilterCodec"/>.
    /// <para/>
    /// See <see cref="Lucene.Net.Codecs.Lucene46"/> package documentation for file format details.
    /// <para/>
    /// @lucene.experimental 
    /// </summary>
    // NOTE: if we make largish changes in a minor release, easier to just make Lucene46Codec or whatever
    // if they are backwards compatible or smallish we can probably do the backwards in the postingsreader
    // (it writes a minor version, etc).
    [CodecName("Lucene46HighCompression")]
    public class Lucene46HighCompressionCodec : Codec {
        private readonly StoredFieldsFormat fieldsFormat = new Lucene41StoredFieldsHighCompressionFormat();    //<--This is the only line different then the stock Lucene46Codec
        private readonly TermVectorsFormat vectorsFormat = new Lucene42TermVectorsFormat();
        private readonly FieldInfosFormat fieldInfosFormat = new Lucene46FieldInfosFormat();
        private readonly SegmentInfoFormat segmentInfosFormat = new Lucene46SegmentInfoFormat();
        private readonly LiveDocsFormat liveDocsFormat = new Lucene40LiveDocsFormat();

        private readonly PostingsFormat postingsFormat;

        private class PerFieldPostingsFormatAnonymousInnerClassHelper : PerFieldPostingsFormat {
            private readonly Lucene46HighCompressionCodec outerInstance;

            public PerFieldPostingsFormatAnonymousInnerClassHelper(Lucene46HighCompressionCodec outerInstance) {
                this.outerInstance = outerInstance;
            }

            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public override PostingsFormat GetPostingsFormatForField(string field) {
                return outerInstance.GetPostingsFormatForField(field);
            }
        }

        private readonly DocValuesFormat docValuesFormat;

        private class PerFieldDocValuesFormatAnonymousInnerClassHelper : PerFieldDocValuesFormat {
            private readonly Lucene46HighCompressionCodec outerInstance;

            public PerFieldDocValuesFormatAnonymousInnerClassHelper(Lucene46HighCompressionCodec outerInstance) {
                this.outerInstance = outerInstance;
            }

            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public override DocValuesFormat GetDocValuesFormatForField(string field) {
                return outerInstance.GetDocValuesFormatForField(field);
            }
        }

        /// <summary>
        /// Sole constructor. </summary>
        public Lucene46HighCompressionCodec()
            : base() {
            postingsFormat = new PerFieldPostingsFormatAnonymousInnerClassHelper(this);
            docValuesFormat = new PerFieldDocValuesFormatAnonymousInnerClassHelper(this);
        }

        public override sealed StoredFieldsFormat StoredFieldsFormat => fieldsFormat;

        public override sealed TermVectorsFormat TermVectorsFormat => vectorsFormat;

        public override sealed PostingsFormat PostingsFormat => postingsFormat;

        public override sealed FieldInfosFormat FieldInfosFormat => fieldInfosFormat;

        public override sealed SegmentInfoFormat SegmentInfoFormat => segmentInfosFormat;

        public override sealed LiveDocsFormat LiveDocsFormat => liveDocsFormat;

        /// <summary>
        /// Returns the postings format that should be used for writing
        /// new segments of <paramref name="field"/>.
        /// <para/>
        /// The default implementation always returns "Lucene41"
        /// </summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public virtual PostingsFormat GetPostingsFormatForField(string field) {
            // LUCENENET specific - lazy initialize the codec to ensure we get the correct type if overridden.
            if (defaultFormat == null) {
                defaultFormat = Lucene.Net.Codecs.PostingsFormat.ForName("Lucene41");
            }
            return defaultFormat;
        }

        /// <summary>
        /// Returns the docvalues format that should be used for writing
        /// new segments of <paramref name="field"/>.
        /// <para/>
        /// The default implementation always returns "Lucene45"
        /// </summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public virtual DocValuesFormat GetDocValuesFormatForField(string field) {
            // LUCENENET specific - lazy initialize the codec to ensure we get the correct type if overridden.
            if (defaultDVFormat == null) {
                defaultDVFormat = Lucene.Net.Codecs.DocValuesFormat.ForName("Lucene45");
            }
            return defaultDVFormat;
        }

        public override sealed DocValuesFormat DocValuesFormat => docValuesFormat;

        // LUCENENET specific - lazy initialize the codecs to ensure we get the correct type if overridden.
        private PostingsFormat defaultFormat;
        private DocValuesFormat defaultDVFormat;

        private readonly NormsFormat normsFormat = new Lucene42NormsFormat();

        public override sealed NormsFormat NormsFormat => normsFormat;
    }

由于@ NightOwl888,我现在知道您还需要在启动时像这样注册新的编解码器:

Thanks to @NightOwl888, I now understand that you will also need to register the new Codec at startup like so:

Codec.SetCodecFactory(new DefaultCodecFactory {
    CustomCodecTypes = new Type[] { typeof(Lucene46HighCompressionCodec) }
});

这是一个单元测试,用于演示High Compression编解码器的使用:

Here is a unit test to demonstrate use of the High Compression Codec:

public class TestCompression {


        [Fact]
        public void HighCompression() {
            FxTest.Setup();

            Directory indexDir = new RAMDirectory();

            Analyzer standardAnalyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);

            IndexWriterConfig indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, standardAnalyzer);
            indexConfig.Codec = new Lucene46HighCompressionCodec();     //<--------Install the High Compression codec.

            indexConfig.UseCompoundFile = true;

            IndexWriter writer = new IndexWriter(indexDir, indexConfig);

            //souce: https://github.com/apache/lucenenet/blob/Lucene.Net_4_8_0_beta00006/src/Lucene.Net/Search/SearcherFactory.cs
            SearcherManager searcherManager = new SearcherManager(writer, applyAllDeletes: true, new SearchWarmer());

            Document doc = new Document();
            doc.Add(new StringField("examplePrimaryKey", "001", Field.Store.YES));
            doc.Add(new TextField("exampleField", "Unique gifts are great gifts.", Field.Store.YES));
            writer.AddDocument(doc);

            doc = new Document();
            doc.Add(new StringField("examplePrimaryKey", "002", Field.Store.YES));
            doc.Add(new TextField("exampleField", "Everyone is gifted.", Field.Store.YES));
            writer.AddDocument(doc);

            doc = new Document();
            doc.Add(new StringField("examplePrimaryKey", "003", Field.Store.YES));
            doc.Add(new TextField("exampleField", "Gifts are meant to be shared.", Field.Store.YES));
            writer.AddDocument(doc);

            writer.Commit();

            searcherManager.MaybeRefreshBlocking();
            IndexSearcher indexSearcher = searcherManager.Acquire();
            try {
                QueryParser parser = new QueryParser(LuceneVersion.LUCENE_48, "exampleField", standardAnalyzer);
                Query query = parser.Parse("everyone");

                TopDocs topDocs = indexSearcher.Search(query, int.MaxValue);

                int numMatchingDocs = topDocs.ScoreDocs.Length;
                Assert.Equal(1, numMatchingDocs);


                Document docRead = indexSearcher.Doc(topDocs.ScoreDocs[0].Doc);
                string primaryKey = docRead.Get("examplePrimaryKey");
                Assert.Equal("002", primaryKey);

            } finally {
                searcherManager.Release(indexSearcher);
            }

        }

    }

虽然我最初的答复是通过 Lucene.Net github问题,但我m在此处提供答案,以使它对Lucene.Net社区具有更好的可见性,希望对其他人也有所帮助.对于那些感兴趣的人,在该问题的线程末尾有更多关于使用自定义编解码器的背景信息.

While my initial response was via a Lucene.Net github issue, I'm providing the answer here where it will have better visibility to the Lucene.Net community in hopes that it helps others as well. For those interested, there is more background information about using a custom codec towards the end of that issue's thread.

这篇关于如何在Lucene.Net 4.8中使用HIGH_COMPRESSION的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆