如何减小生成的 Lucene/Solr 索引的大小? [英] How to reduce the size of a generated Lucene/Solr index?

查看:20
本文介绍了如何减小生成的 Lucene/Solr 索引的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个搜索系统的原型.

我在 oracle 中有一个包含一些字段的表.我生成了看起来真实的数据.大约 300.000 行.例如:

<前>PaymentNo|Datetime |AmountEuro|PayersName |PayersPhoneNo|ReceiversLegal|ReceiversAcc2314 |2015-07-21T15:14|15.63 |克林顿,巴拉克安杰拉|1.918.0060657|美国宇航局 |5555569778664190000230338 |2015-08-01T15:14|34.87 |默克尔、乔治·唐纳德 |1.653.0060658|百事可乐 |7777828443194736000

(实际上还有更多的列)

oracle 中表的大小 62 MB(Toad 报告)

我将表导入 Solr 5.2.1(在 Windows 中).带数据的索引大小为 88 MB(在磁盘上).没有数据的索引大小为67 MB.

我的问题是:我可以减少索引的大小吗?

这些选项已经过测试:减少索引表列的数量.关闭 Solr 中的数据存储.从索引中排除部分行.

我需要一个额外的机会来减小索引的大小.你知道任何?

解决方案

您可以使用提供的所有见解 这里.我想分享的一些其他要点.

Solr 会复制数据以提供对索引数据的快速搜索.solr 的重要一点是,它使用不可变的数据结构来存储所有数据.

  • 术语词典:索引术语词典及其频率和发布列表的偏移量.
  • 术语向量:Solr 存储每个索引文档的术语向量.这本质上是每个文档的单独倒排索引.这通常需要大量存储.
  • 存储文档:按顺序存储每个文档及其字段.
  • Doc values :将所有文档的字段存储在一起.这类似于数据的列式存储.

如果您不使用 solr 的 solr 突出显示功能,您可以禁用文档级别的术语向量存储.

此外,Solr 对不同类型的数据使用了许多不同的压缩技术.它使用位压缩/vint 压缩来发布列表和数值.用于存储字段和术语向量的 LZ4 压缩.它使用 FST 数据结构来存储术语字典.FST是Trie数据结构的一种特殊实现.

I am working on a prototype of a search system.

I have a table in oracle with some fields. I generated data that looks real. Around 300.000 rows. For example:

PaymentNo|Datetime        |AmountEuro|PayersName            |PayersPhoneNo|ReceiversLegal|ReceiversAcc
2314     |2015-07-21T15:14|15.63     |Clinton, Barack Anjela|1.918.0060657|Nasa          |5555569778664190000
230338   |2015-08-01T15:14|34.87     |Merkel, George Donald |1.653.0060658|PepsiCo       |7777828443194736000

( actually there are more columns)

The size of table in oracle 62 MB (Toad reports)

I imported table into Solr 5.2.1 (in Windows). The size of index with data is 88 MB (on disk). The size of index without data is 67 MB.

My question is: Can I decrease the size of index?

These options are already tested: Decreasing the amount of indexed table columns. Switching off data storage in Solr. Excluding some part of rows from index.

I need an extra opportunity to decrease a size of an index. Do you know any?

解决方案

You can use all the insights provided here. Some additional points I wanted to share.

Solr does duplication of the data for providing the fast search over indexed data. One important thing about solr is, it uses immutable data structure for storing all the data.

  • Term Dictionary : Dictionary of indexed terms along with their frequency and offset to posting lists.
  • Term Vectors: Solr stores the term vector for each document indexed. This is essentially a separate inverted index for each document. This is usually storage heavy.
  • Stored Docs : stores each document with their fields in sequential order.
  • Doc values : stores fields for all the document together. This is similar to columnar storage of data.

You can disable the document level Term Vectors storage if you are not using solr highlighting feature of the solr.

Additionally, Solr uses many different compression techniques for different type of data. It uses bit packing/vint compression for posting lists and numerical values. LZ4 compression for stored fields and term vectors. It uses FST data structure for storing the Term Dictionary. FST is an special implementation of Trie data structure.

这篇关于如何减小生成的 Lucene/Solr 索引的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆