即时以PB数据为单位进行搜索 [英] instant searching in petabyte of data
问题描述
我需要在CSV formate文件中搜索PB级以上的数据。使用LUCENE编制索引后,索引文件的大小比原始文件大2倍。是否有可能减少索引文件的大小?如何在HADOOP中分配LUCENE索引文件以及如何在搜索环境中使用?或者是否有必要,我是否应该使用solr来分发LUCENE索引?我的要求是在PB级以上的文件上进行即时搜索....
任何体面的现成搜索引擎(如Lucene)应该能够在您拥有的数据大小上提供搜索功能。您可能需要做一些工作来设计索引并配置搜索的工作方式,但这只是配置。
您无法获得即时结果但你可能会很快得到结果。速度可能取决于你如何设置和运行什么样的硬件。 你提到索引大于原始数据。这是可以预料的。索引通常包括某种形式的非规范化。指数的大小往往与速度有关;提前分割和分割数据的方法越多,找到引用的速度就越快。 I need to search over petabyte of data in CSV formate files. After indexing using LUCENE, the size of the indexing file is doubler than the original file. Is it possible to reduce the indexed file size??? How to distribute LUCENE index files in HADOOP and how to use in searching environment? or is it necessary, should i use solr to distribute the LUCENE index??? My requirement is doing instant search over petabyte of files.... Any decent off the shelf search engine (like Lucene) should be able to provide search functionality over the size of data you have. You may have to do a bit of work up front to design the indexes and configure how the search works, but this is just config. You won't get instant results but you might be able to get very quick results. The speed will probably depend on how you set it up and what kind of hardware you run on. You mention that the indexes are larger than the original data. This is to be expected. Indexing usually includes some form of denormalisation. The size of the indexes is often a trade off with speed; the more ways you slice and dice the data in advance, the quicker it is to find references. Lastly you mention distributing the indexes, this is almost certainly not something you want to do. The practicalities of distributing many petabytes of data are pretty daunting. What you probably want is to have the indexes sat on a big fat computer somewhere and provide search services on the data (bring the query to the data, don't take the data to the query). 这篇关于即时以PB数据为单位进行搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!