调整Neo4j的性能 [英] Tuning Neo4j for Performance

查看:124
本文介绍了调整Neo4j的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Michael Hunger的Batch Import导入了数据,通过它我创建了:-

I have imported data using Michael Hunger's Batch Import, through which I created:-

4,612,893 nodes
14,495,063 properties
    node properties are indexed.
5,300,237 relationships

{问题},Cypher查询的执行速度太慢,几乎无法抓取,简单遍历需要5分钟以上的时间才能返回结果集,请让我知道如何调整服务器以获得更好的性能以及我是什么做错了.

{Question} Cypher queries are executing too slow almost crawling, simple traversal is taking > 5 mins to return resultset, Please let me know how to tune the server to get better performance and what I am doing wrong.

商店详细信息:-

-rw-r--r-- 1 root root 567M Jul 12 12:42 data/graph.db/neostore.propertystore.db
-rw-r--r-- 1 root root 167M Jul 12 12:42 data/graph.db/neostore.relationshipstore.db
-rw-r--r-- 1 root root  40M Jul 12 12:42 data/graph.db/neostore.nodestore.db
-rw-r--r-- 1 root root 7.8M Jul 12 12:42 data/graph.db/neostore.propertystore.db.strings
-rw-r--r-- 1 root root  330 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index.keys
-rw-r--r-- 1 root root  292 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db.names
-rw-r--r-- 1 root root  153 Jul 12 12:42 data/graph.db/neostore.propertystore.db.arrays
-rw-r--r-- 1 root root   88 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index
-rw-r--r-- 1 root root   69 Jul 12 12:42 data/graph.db/neostore
-rw-r--r-- 1 root root   58 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.nodestore.db.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.arrays.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index.keys.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.strings.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.relationshipstore.db.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db.names.id

我正在使用

neo4j-community-1.9.1
java version "1.7.0_25"
Amazon EC2 m1.large instance with Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-40-virtual x86_64)
RAM ~8GB.
EBS 200 GB, neo4j is running on EBS volume.

调用为./neo4j-community-1.9.1/bin/neo4j start

Invoked as ./neo4j-community-1.9.1/bin/neo4j start

以下是neo4j服务器信息:

Below are the neo4j server info:

neostore.nodestore.db.mapped_memory                 161M
neostore.relationshipstore.db.mapped_memory         714M
neostore.propertystore.db.mapped_memory              90M
neostore.propertystore.db.index.keys.mapped_memory    1M
neostore.propertystore.db.strings.mapped_memory     130M
neostore.propertystore.db.arrays.mapped_memory      130M
mapped_memory_page_size                               1M
all_stores_total_mapped_memory_size                 500M


{数据模型}就像社交图谱:-


{Data Model} is like Social Graph :-

User-User
    User-[:FOLLOWS]->User
User-Item
    User-[:CREATED]->Item
    User-[:LIKE]->Item
    User-[:COMMENT]->Item
    User-[:VIEW]->Item
Cluster-User
    User-[:FACEBOOK]->SocialLogin_Cluster
Cluster-Item
    Item-[:KIND_OF]->Type_Cluster
Cluster-Cluster
    Cluster-[:KIND_OF]->Type_Cluster

{一些查询}和时间:

START u=node(467242)
MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b
WHERE NOT(a=b)
RETURN u,COUNT(b)

查询耗时1015348ms.返回70956115结果计数.

Query took 1015348ms. Returned 70956115 result count.

START a=node:nodes(kind="user")
RETURN a,length(a-[:CREATED|LIKE|COMMENT|FOLLOWS]-()) AS cnt
ORDER BY cnt DESC
LIMIT 10

查询耗时231613ms

Query took 231613ms

从建议中,我将方框升级为M1.xlarge和M2.2xlarge

From the suggestions, I upgraged the box to M1.xlarge and M2.2xlarge

  • M1.xlarge(vCPU:4,ECU:8,RAM:15 GB,实例存储:〜600 GB)
  • M2.2xlarge(vCPU:4,ECU:13,RAM:34 GB,实例存储:〜800 GB)

我调整了如下所示的属性,并从实例存储运行(与EBS相对)

I tuned the properties like below, and running from instance storage (as against EBS)

neo4j.properties

neo4j.properties

neostore.nodestore.db.mapped_memory=1800M
neostore.relationshipstore.db.mapped_memory=1800M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=150M
neostore.propertystore.db.arrays.mapped_memory=10M

neo4j-wrapper.conf

neo4j-wrapper.conf

wrapper.java.additional.1=-d64
wrapper.java.additional.1=-server
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.initmemory=4098
wrapper.java.maxmemory=8192

但查询(如下所示)仍在约5-8分钟内运行,从推荐角度来看,这是不可接受的.

but still the queries (like below) run in minutes ~5-8 minutes, which is not acceptable from recommendation point of view.

查询:

START u=node(467242)
MATCH u-[r1:LIKE]->a<-[r2:LIKE]-lu-[r3:LIKE]-b
RETURN u,COUNT(b)


{分析}

neo4j-sh (0)$ profile START u=node(467242) MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b RETURN u,COUNT(b);
==> +-------------------------+
==> | u            | COUNT(b) |
==> +-------------------------+
==> | Node[467242] | 70960482 |
==> +-------------------------+
==> 1 row
==> 
==> ColumnFilter(symKeys=["u", "  INTERNAL_AGGREGATEad2ab10d-cfc3-48c2-bea9-be4b9c1b5595"], returnItemNames=["u", "COUNT(b)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=["u"], aggregates=["(  INTERNAL_AGGREGATEad2ab10d-cfc3-48c2-bea9-be4b9c1b5595,Count)"], _rows=1, _db_hits=0)
==>   TraversalMatcher(trail="(u)-[r1:LIKE|COMMENT WHERE true AND true]->(a)<-[r2:LIKE|COMMENT WHERE true AND true]-(lu)-[r3:LIKE WHERE true AND true]-(b)", _rows=70960482, _db_hits=71452891)
==>     ParameterPipe(_rows=1, _db_hits=0)


neo4j-sh (0)$ profile START u=node(467242) MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b RETURN count(distinct a),COUNT(distinct b),COUNT(*);
==> +--------------------------------------------------+
==> | count(distinct a) | COUNT(distinct b) | COUNT(*) |
==> +--------------------------------------------------+
==> | 1950              | 91294             | 70960482 |
==> +--------------------------------------------------+
==> 1 row
==> 
==> ColumnFilter(symKeys=["  INTERNAL_AGGREGATEe6b94644-0a55-43d9-8337-491ac0b29c8c", "  INTERNAL_AGGREGATE1cfcd797-7585-4240-84ef-eff41a59af33", "  INTERNAL_AGGREGATEea9176b2-1991-443c-bdd4-c63f4854d005"], returnItemNames=["count(distinct a)", "COUNT(distinct b)", "COUNT(*)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["(  INTERNAL_AGGREGATEe6b94644-0a55-43d9-8337-491ac0b29c8c,Distinct)", "(  INTERNAL_AGGREGATE1cfcd797-7585-4240-84ef-eff41a59af33,Distinct)", "(  INTERNAL_AGGREGATEea9176b2-1991-443c-bdd4-c63f4854d005,CountStar)"], _rows=1, _db_hits=0)
==>   TraversalMatcher(trail="(u)-[r1:LIKE|COMMENT WHERE true AND true]->(a)<-[r2:LIKE|COMMENT WHERE true AND true]-(lu)-[r3:LIKE WHERE true AND true]-(b)", _rows=70960482, _db_hits=71452891)
==>     ParameterPipe(_rows=1, _db_hits=0)

请让我知道用于调整的配置和neo4j启动参数. 预先感谢

Please let me know the configuration and neo4j startup arguments for tuning. Thanks in advance

推荐答案

在我的Macbook air上运行此程序,并在数据集中使用很少的RAM和CPU.

Running this on my macbook air with little RAM and CPU with your dataset.

通过更多的内存映射,GCR缓存和更多的缓存堆,您得到的结果将比我的结果快得多.另外,请确保在查询中使用参数.

You will get much faster than my results with more memory mapping, GCR cache and more heap for caches. Also make sure to use parameters in your queries.

您正在遇到组合爆炸.

路径的每一步都会向匹配的子图添加时间相关"元素/行.

Every step of the path adds "times rels" elements/rows to your matched subgraphs.

例如,在这里看到:您最终获得269268场比赛,但只有81674个不同的lu's

See for instance here: you end up at 269268 matches but you only have 81674 distinct lu's

问题在于,对于每行,下一个匹配项都会扩展.因此,如果您在两者之间使用"distinct"来再次限制大小,则数据量将减少一定程度. 下一级相同.

The problem is that for each row the next match is expanded. So if you use distinct in between to limit the sizes again it will be some order of magnitutes less data. Same for the next level.

START u=node(467242)
MATCH u-[:LIKED|COMMENTED]->a
WITH distinct a
MATCH a<-[r2:LIKED|COMMENTED]-lu
RETURN count(*),count(distinct a),count(distinct lu);

+---------------------------------------------------+
| count(*) | count(distinct a) | count(distinct lu) |
+---------------------------------------------------+
| 269268   | 1952              | 81674              |
+---------------------------------------------------+
1 row

895 ms

START u=node(467242)
MATCH u-[:LIKED|COMMENTED]->a
WITH distinct a
MATCH a<-[:LIKED|COMMENTED]-lu
WITH distinct lu
MATCH lu-[:LIKED]-b
RETURN count(*),count(distinct lu), count(distinct b)
;
+---------------------------------------------------+
| count(*) | count(distinct lu) | count(distinct b) |
+---------------------------------------------------+
| 2311694  | 62705              | 91294             |
+---------------------------------------------------+

在这里,您总共有230万次比赛,并且只有91k个不同的元素.所以差不多两个数量级.

Here you have 2.3M total matches and only 91k distinct elements. So almost 2 orders of magnitude.

这是一个巨大的聚合,它是一个OLTP查询的BI/统计查询. 通常,您可以存储结果,例如在用户节点上,仅在后台重新执行.

This is a huge aggregation which is rather a BI / statistics query that an OLTP query. Usually you can store the results e.g. on the user-node and only re-execute this in the background.

这类查询还是全局图查询(统计信息/BI),在本例中为前10位用户.

THESE kind of queries are again global graph queries (statistics / BI ), in this case top 10 users.

通常,您将在后台(例如,每天或每小时一次)运行这些,并将前10个用户节点连接到特殊的节点或索引,然后可以在几毫秒内查询到该

Usually you would run these in the background (e.g. once per day or hour) and connect the top 10 user nodes to a special node or index that then can be queried in a few ms.

START a=node:nodes(kind="user") RETURN count(*);
+----------+
| count(*) |
+----------+
| 3889031  |
+----------+
1 row

27329 ms

毕竟,您要对整个图进行匹配,即4M用户是全局图,而不是局部图查询.

After all you are running a match across the whole graph, i.e. 4M users that's a graph global, not a graph local query.

START n=node:nodes(kind="top-user")
MATCH n-[r?:TOP_USER]-()
DELETE r
WITH distinct n
START a=node:nodes(kind="user")
MATCH a-[:CREATED|LIKED|COMMENTED|FOLLOWS]-()
WITH n, a,count(*) as cnt
ORDER BY cnt DESC
LIMIT 10
CREATE a-[:TOP_USER {count:cnt} ]->n;

+-------------------+
| No data returned. |
+-------------------+
Relationships created: 10
Properties set: 10
Relationships deleted: 10

70316 ms

查询将是:

START n=node:nodes(kind="top-user")
MATCH n-[r:TOP_USER]-a
RETURN a, r.count
ORDER BY r.count DESC;

+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| a                                                                                                                                                  | r.count |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
….
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows

4 ms

这篇关于调整Neo4j的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆