Lucene vs Solr,Samp数据的索引速度 [英] Lucene vs Solr, indexning speed for sampe data

查看:220
本文介绍了Lucene vs Solr,Samp数据的索引速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我曾经在Lucene工作过,现在正朝着Solr走。
问题是我无法像Solcene一样快速地对Solr进行索引。



我的Lucene代码:

  public class LuceneIndexer {

public static void main(String [] args){

String indexDir =/ home / demo / indexes / index1 /;
IndexWriterConfig indexWriterConfig = null;

long starttime = System.currentTimeMillis();

try(Directory dir = FSDirectory.open(Paths.get(indexDir));
Analyzer analyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(dir,
(indexWriterConfig = new IndexWriterConfig(analyzer)));){
indexWriterConfig.setOpenMode(OpenMode.CREATE);

StringField bat = new StringField(bat,,Store.YES); // $ NON-NLS-1 $ // $ NON-NLS-2 $
StringField id = new StringField(id,,Store.YES); // $ NON-NLS-1 $ // $ NON-NLS-2 $
StringField name = new StringField(name,,Store.YES); // $ NON-NLS-1 $ // $ NON-NLS-2 $
StringField id1 = new StringField(id1,,Store.YES); // $ NON-NLS-1 $ // $ NON-NLS-2 $
StringField name1 = new StringField(name1,,Store.YES); // $ NON-NLS-1 $ // $ NON-NLS-2 $
StringField id2 = new StringField(id2,,Store.YES); // $ NON-NLS-1 $ // $ NON-NLS-2 $

Document doc = new Document();
doc.add(bat); doc.add(id); doc.add(name); doc.add(id1); doc.add(name1); doc.add(id2);

for(int i = 0; i <1000000; ++ i){
bat.setStringValue(book+ i);
id.setStringValue(book id - + i);
name.setStringValue(霍比特人第一部分的传说+ i);
id1.setStringValue(book id - + i);
name1.setStringValue(霍比特人2的传说+ i);
id2.setStringValue(book id - + i); // doc.addField(id2,book id - + i); // $ NON-NLS-1 $

indexWriter.addDocument(doc);
}
} catch(Exception e){
e.printStackTrace();
}
long endtime = System.currentTimeMillis();
System.out.println(提交); // $ NON-NLS-1 $
System.out.println(process completed in+(endtime-starttime)/ 1000 +seconds); // $ NON-NLS-1 $ // $ NON-NLS-2 $
}
}

输出:
过程在19秒内完成



其次是我的Solr代码:

  SolrClient solrClient = new HttpSolrClient(http:// localhost:8983 / solr / gettingstarted); // $ NON-NLS-1 $ 

//清空数据库...
solrClient.deleteByQuery(*:*); //删除所有内容! // $ NON-NLS-1 $
System.out.println(cleared); // $ NON-NLS-1 $
ArrayList< SolrInputDocument> docs = new ArrayList<>();



long starttime = System.currentTimeMillis();
for(int i = 0; i <1000000; ++ i){
SolrInputDocument doc = new SolrInputDocument();
doc.addField(bat,biok+ i); // $ NON-NLS-1 $ // $ NON-NLS-2 $
doc.addField(id,biok id - + i); // $ NON-NLS-1 $ // $ NON-NLS-2 $
doc.addField(name,Tle Legend of the Hobbit part 1+ i); // $ NON-NLS-1 $ // $ NON-NLS-2 $
doc.addField(id1,bopk id - + i); // $ NON-NLS-1 $ // $ NON-NLS-2 $
doc.addField(name1,Tue Legend of the Hobbit part 2+ i); // $ NON-NLS-1 $ // $ NON-NLS-2 $
doc.addField(id2,bopk id - + i); // $ NON-NLS-1 $ // $ NON-NLS-2 $

docs.add(doc);

if(i%250000 == 0){
solrClient.add(docs);
docs.clear();
}
}
solrClient.add(docs);
System.out.println(完成添加到Solr。正在提交..请稍候); // $ NON-NLS-1 $
solrClient.commit();
long endtime = System.currentTimeMillis();
System.out.println(process completed in+(endtime-starttime)/ 1000 +seconds); // $ NON-NLS-1 $ // $ NON-NLS-2 $

输出:< 159秒完成流程



我的pom.xml是

 <! -  solr dependency  - > 
< dependency>
< groupId> org.apache.solr< / groupId>
< artifactId> solr-solrj< / artifactId>
< version> 5.0.0< / version>
< /依赖关系>

<! - 其他依赖关系 - >
< dependency>
< groupId> commons-logging< / groupId>
< artifactId> commons-logging< / artifactId>
< version> 1.1.1< / version>
< /依赖关系>

<! - Lucene依赖关系 - >
< dependency>
< groupId> org.apache.lucene< / groupId>
< artifactId> lucene-core< / artifactId>
< version> 5.0.0< / version>
< /依赖关系>
< dependency>
< groupId> org.apache.lucene< / groupId>
< artifactId> lucene-analysers-common< / artifactId>
< version> 5.0.0< / version>
< /依赖关系>

我已经下载了solr 5.0,然后开始使用
$ solr / bin / solr启动-e cloud -noprompt
开始solr在2个节点中。

我没有改变任何我已经下载的solr设置,任何人都可以引导我至于发生了什么事。我读过solr可以用于近实时索引( http://lucene.apache.org/ solr / features.html ),而我在演示代码中无法做到这一点,不过,Lucene在索引中速度很快,如果不是实时的话,它可以用于接近实时的操作。



我知道Solr使用Lucene,所以我犯的错误是什么。我仍在研究这种情况。



任何帮助或指导是最受欢迎的。

感谢您的提前。
cheers)

解决方案

Solr是一个通用高度可配置的搜索服务器。 Solr中的Lucene代码被调优用于一般用途,而不是特定用例。在配置和请求语法中可以进行一些调整。



针对特定用例编写的经过良好调优的Lucene代码将始终优于Solr。缺点是您必须自己编写,测试和调试搜索代码的低级实现。如果这对你来说不是主要的缺点,那么你可能想要坚持使用Lucene。您将拥有比Solr能够提供的更多功能,并且您很可能会使其运行得更快。



您从Solr邮件列表中的Erick收到的回复是相关的。为了获得最佳的索引性能,你的客户端必须并行地向Solr发送更新。



他提到的ConcurrentUpdateSolrClient是一种方法,但它具有相当的性能主要缺点 - 如果任何索引请求失败,客户端代码将不会被通知。 CUSC吞噬了大多数异常。



如果您想要进行适当的异常处理,您需要自己管理线程,如果选择运行SolrCloud,则需要使用HttpSolrClient或CloudSolrClient。 SolrClient实现是线程安全的。


I have worked upon Lucene before and now moving towards Solr. The problem is that I am not able to do Indexing on Solr as fast as Lucene can do.

My Lucene Code:

public class LuceneIndexer {

public static void main(String[] args) {

    String indexDir = "/home/demo/indexes/index1/"; 
    IndexWriterConfig indexWriterConfig = null;

    long starttime = System.currentTimeMillis();

    try (Directory dir = FSDirectory.open(Paths.get(indexDir));
            Analyzer analyzer = new StandardAnalyzer();
            IndexWriter indexWriter = new IndexWriter(dir,
                    (indexWriterConfig = new IndexWriterConfig(analyzer)));) {
        indexWriterConfig.setOpenMode(OpenMode.CREATE);

            StringField bat = new StringField("bat", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
            StringField id = new StringField("id", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
            StringField name = new StringField("name", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
            StringField id1 = new StringField("id1", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
            StringField name1 = new StringField("name1", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
            StringField id2 = new StringField("id2", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$

            Document doc = new Document();
            doc.add(bat);doc.add(id);doc.add(name);doc.add(id1);doc.add(name1);doc.add(id2);

        for (int i = 0; i < 1000000; ++i) { 
             bat.setStringValue("book"+i);
             id.setStringValue("book id -" + i);
             name.setStringValue("The Legend of the Hobbit part 1 " + i);
             id1.setStringValue("book id -" + i);
             name1.setStringValue("The Legend of the Hobbit part 2 " + i); 
             id2.setStringValue("book id -" + i);//doc.addField("id2", "book id -" + i); //$NON-NLS-1$ 

             indexWriter.addDocument(doc);
        }
    }catch(Exception e) {
        e.printStackTrace();
    }
    long endtime = System.currentTimeMillis();
    System.out.println("commited"); //$NON-NLS-1$
    System.out.println("process completed in "+(endtime-starttime)/1000+" seconds"); //$NON-NLS-1$ //$NON-NLS-2$
}
}

Output: Process completed in 19 seconds

Followed By My Solr Code:

    SolrClient solrClient = new HttpSolrClient("http://localhost:8983/solr/gettingstarted"); //$NON-NLS-1$

    // Empty the database...
    solrClient.deleteByQuery( "*:*" );// delete everything! //$NON-NLS-1$
    System.out.println("cleared"); //$NON-NLS-1$
    ArrayList<SolrInputDocument> docs = new ArrayList<>();



    long starttime = System.currentTimeMillis();
    for (int i = 0; i < 1000000; ++i) { 
        SolrInputDocument doc = new SolrInputDocument();
        doc.addField("bat", "biok"+i); //$NON-NLS-1$ //$NON-NLS-2$
        doc.addField("id", "biok id -" + i); //$NON-NLS-1$ //$NON-NLS-2$
        doc.addField("name", "Tle Legend of the Hobbit part 1 " + i); //$NON-NLS-1$ //$NON-NLS-2$
        doc.addField("id1", "bopk id -" + i); //$NON-NLS-1$ //$NON-NLS-2$
        doc.addField("name1", "Tue Legend of the Hobbit part 2 " + i); //$NON-NLS-1$ //$NON-NLS-2$
        doc.addField("id2", "bopk id -" + i); //$NON-NLS-1$ //$NON-NLS-2$

        docs.add(doc);

        if (i % 250000 == 0) {
            solrClient.add(docs);
            docs.clear();
        }
    }
    solrClient.add(docs);
    System.out.println("completed adding to Solr. Now commiting.. Please wait"); //$NON-NLS-1$
    solrClient.commit();
    long endtime = System.currentTimeMillis();
    System.out.println("process completed in "+(endtime-starttime)/1000+" seconds"); //$NON-NLS-1$ //$NON-NLS-2$

Output : process completed in 159 seconds

My pom.xml is

<!-- solr dependency -->
    <dependency>
        <groupId>org.apache.solr</groupId>
        <artifactId>solr-solrj</artifactId>
        <version>5.0.0</version>
    </dependency>

<!-- other dependency -->   
    <dependency>
        <groupId>commons-logging</groupId>
        <artifactId>commons-logging</artifactId>
        <version>1.1.1</version>
    </dependency>

<!-- Lucene dependency -->
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>5.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>5.0.0</version>
    </dependency>

I have downloaded solr 5.0 and then have started solr using $solr/bin/solr start -e cloud -noprompt which starts solr in 2 nodes.

I havent changed anything in the solr setup which I have downloaded, can any one guide me as to what is going wronge. I read that solr can be used for near real time indexing (http://lucene.apache.org/solr/features.html) and I am not able to do that in my demo code, though, Lucene is fast in indexing and can be used to do so in near real time if not real time.

I know Solr uses Lucene, so what is the mistake that I am making.. I am still researching the scenario.

Any help or guidance is most welcomed.

Thanks in Advance.!! cheers:)

解决方案

Solr is a general-purpose highly-configurable search server. The Lucene code in Solr is tuned for general use, not specific use cases. Some tuning is possible in the configuration and the request syntax.

Well-tuned Lucene code written for a specific use-case will always outperform Solr. The disadvantage is that you must write, test, and debug the low-level implementation of the search code yourself. If that's not a major disadvantage to you, then you might want to stick to Lucene. You'll have more capability than Solr can give you, and you can very likely make it run faster.

The response you got from Erick on the Solr mailing list is relevant. To get the best indexing performance, your client must send updates to Solr in parallel.

The ConcurrentUpdateSolrClient that he mentioned is one way to do this, but it comes with a fairly major disadvantage -- the client code will not be informed if any of those indexing requests fails. CUSC swallows most exceptions.

If you want proper exception handling, you will need to manage the threads yourself and use HttpSolrClient, or CloudSolrClient if you choose to run SolrCloud. The SolrClient implementations are thread-safe.

这篇关于Lucene vs Solr,Samp数据的索引速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆