Solr 4.6.0 DataImportHandler 加速性能 [英] Solr 4.6.0 DataImportHandler speed up performance

查看：49 发布时间：2021/6/5 19:09:56 mysql performance solr dataimporthandler

本文介绍了Solr 4.6.0 DataImportHandler 加速性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Solr 4.6.0，一次索引大约 10'000 个元素，但导入性能很差.这意味着导入这 10'000 个文档大约需要 10 分钟.我当然知道，这几乎不取决于服务器硬件，但我仍然想知道，如何提高性能以及哪些在现实世界中真正有用情况(连接等)?我也非常感谢提供精确的示例，而不仅仅是官方文档的链接.

I am using Solr 4.6.0, indexing about 10'000 elements at a time and I suffer bad import performance. That means that importing those 10'000 documents takes about 10 minutes. Of course I know, that this hardly depends on the server hardware, but I still would like to know, how any performance boosts could be done and which of them are actually useful in real-world situations (joins etc.)? I am also very thankful for precise examples and not just links to the official documentation.

这是data-config.xml

<dataConfig>
    <dataSource name="mysql" type="JdbcDataSource" 
        driver="com.mysql.jdbc.Driver" 
        url="jdbc:mysql://xxxx" 
        batchSize="-1" 
        user="xxxx" password="xxxx" />
    <document name="publications">
        <entity name="publication" transformer="RegexTransformer" pk="id" query="
            SELECT 
                sm_publications.id AS p_id, 
                CONCAT(sm_publications.title, ' ', sm_publications.abstract) AS p_text,
                sm_publications.year AS p_year,
                sm_publications.doi AS p_doi,
                sm_conferences.full_name AS c_fullname,
                sm_journals.full_name AS j_fullname,
                GROUP_CONCAT(DISTINCT sm_query_publications.query_id SEPARATOR '_-_-_-_-_') AS q_id
            FROM sm_publications 
            LEFT JOIN sm_conferences ON sm_conferences.id = sm_publications.conference_id 
            LEFT JOIN sm_journals ON sm_journals.id = sm_publications.journal_id 
            INNER JOIN sm_query_publications ON sm_query_publications.publication_id = sm_publications.id 
            WHERE '${dataimporter.request.clean}' != 'false' OR 
                sm_publications.modified > '${dataimporter.last_index_time}' GROUP BY sm_publications.id">
            <field column="p_id" name="id" />
            <field column="p_text" name="text" />
            <field column="p_text" name="text_tv" />
            <field column="p_year" name="year" />
            <field column="p_doi" name="doi" />
            <field column="c_fullname" name="conference" />
            <field column="j_fullname" name="journal" />
            <field column="q_id" name="queries" splitBy="_-_-_-_-_" />

            <entity name="publication_authors" query="
                SELECT 
                    CONCAT(
                        IF(sm_authors.first_name != '',sm_authors.first_name,''), 
                        IF(sm_authors.middle_name != '',CONCAT(' ',sm_authors.middle_name),''), 
                        IF(sm_authors.last_name != '',CONCAT(' ',sm_authors.last_name),'')
                    ) AS a_name, 
                    sm_affiliations.display_name AS aa_display_name, 
                    CONCAT(sm_affiliations.latitude, ',', sm_affiliations.longitude) AS aa_geo, 
                    sm_affiliations.country_name AS aa_country_name
                FROM sm_publication_authors 
                INNER JOIN sm_authors ON sm_authors.id = sm_publication_authors.author_id 
                LEFT JOIN sm_affiliations ON sm_affiliations.id = sm_authors.affiliation_id 
                WHERE sm_publication_authors.publication_id = '${publication.p_id}'">
                    <field column="a_name" name="authors" />
                    <field column="aa_display_name" name="affiliations" />
                    <field column="aa_geo" name="geo" />
                    <field column="aa_country_name" name="countries" />
            </entity>

            <entity name="publication_keywords" query="
                SELECT sm_keywords.name FROM sm_publication_keywords 
                INNER JOIN sm_keywords ON sm_keywords.id = sm_publication_keywords.keyword_id
                WHERE sm_publication_keywords.publication_id = '${publication.p_id}'">
                    <field column="name" name="keywords" />
            </entity>

        </entity>
    </document>
</dataConfig>

推荐答案

我所说的查询缓存是指 CachedSqlEntityProcessor.我喜欢合并解决方案，就像您的其他问题MySQL GROUP_CONCAT重复条目.但是 CachedSqlEntityProcessor 也会有所帮助，如果 p_id 在主查询的结果集中一遍又一遍地重复publication_authors，并且您不必担心额外的内存使用.

By query caching, I meant the CachedSqlEntityProcessor. I favor the merged solution as in your other question MySQL GROUP_CONCAT duplicate entries. But CachedSqlEntityProcessor will help too, if p_id repeated over and over in the resultset of the main query publication_authors, and you have less concern on the extra memory usage.

更新:看起来您还解决了另外两个问题，可能您可以采用任何一种方式，无论如何我都会按照您的要求发布简短的示例/指针，以防其他人发现它很方便

Update: It looks like you have two other questions solved, probably you can go either way, I post the short example/pointer as you requested anyway in case others find it handy to have

<entity name="x" query="select * from x">
    <entity name="y" query="select * from y" processor="CachedSqlEntityProcessor"  where="xid=x.id">
    </entity>
<entity>

此示例取自 wiki.这仍然会运行每个查询select * from y where xid=id"从主查询select * from x"中的每个id.但它不会重复发送相同的查询.

This example was taken from wiki. This will still run each query "select * from y where xid=id" per id from the main query "select * from x". But it won't send in the same query repeatedly.

这篇关于Solr 4.6.0 DataImportHandler 加速性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Solr 4.6.0 DataImportHandler 加速性能 [英] Solr 4.6.0 DataImportHandler speed up performance

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

Solr 4.6.0 DataImportHandler 加速性能 [英] Solr 4.6.0 DataImportHandler speed up performance

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭