调整 Neo4j 的性能 [英] Tuning Neo4j for Performance

查看:24
本文介绍了调整 Neo4j 的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Michael Hunger 的批量导入导入了数据,我通过它创建了:-

I have imported data using Michael Hunger's Batch Import, through which I created:-

4,612,893 nodes
14,495,063 properties
    node properties are indexed.
5,300,237 relationships

{问题} Cypher 查询的执行速度太慢,几乎是爬行,简单的遍历需要 > 5 分钟才能返回结果集,请告诉我如何调整服务器以获得更好的性能以及我是什么做错了.

{Question} Cypher queries are executing too slow almost crawling, simple traversal is taking > 5 mins to return resultset, Please let me know how to tune the server to get better performance and what I am doing wrong.

店铺详情:-

-rw-r--r-- 1 root root 567M Jul 12 12:42 data/graph.db/neostore.propertystore.db
-rw-r--r-- 1 root root 167M Jul 12 12:42 data/graph.db/neostore.relationshipstore.db
-rw-r--r-- 1 root root  40M Jul 12 12:42 data/graph.db/neostore.nodestore.db
-rw-r--r-- 1 root root 7.8M Jul 12 12:42 data/graph.db/neostore.propertystore.db.strings
-rw-r--r-- 1 root root  330 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index.keys
-rw-r--r-- 1 root root  292 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db.names
-rw-r--r-- 1 root root  153 Jul 12 12:42 data/graph.db/neostore.propertystore.db.arrays
-rw-r--r-- 1 root root   88 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index
-rw-r--r-- 1 root root   69 Jul 12 12:42 data/graph.db/neostore
-rw-r--r-- 1 root root   58 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.nodestore.db.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.arrays.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index.keys.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.strings.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.relationshipstore.db.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db.id
-rw-r--r-- 1 root root    9 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db.names.id

我正在使用

neo4j-community-1.9.1
java version "1.7.0_25"
Amazon EC2 m1.large instance with Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-40-virtual x86_64)
RAM ~8GB.
EBS 200 GB, neo4j is running on EBS volume.

作为 ./neo4j-community-1.9.1/bin/neo4j start 调用

Invoked as ./neo4j-community-1.9.1/bin/neo4j start

以下是neo4j服务器信息:

Below are the neo4j server info:

neostore.nodestore.db.mapped_memory                 161M
neostore.relationshipstore.db.mapped_memory         714M
neostore.propertystore.db.mapped_memory              90M
neostore.propertystore.db.index.keys.mapped_memory    1M
neostore.propertystore.db.strings.mapped_memory     130M
neostore.propertystore.db.arrays.mapped_memory      130M
mapped_memory_page_size                               1M
all_stores_total_mapped_memory_size                 500M

<小时>

{数据模型}就像社交图谱:-

User-User
    User-[:FOLLOWS]->User
User-Item
    User-[:CREATED]->Item
    User-[:LIKE]->Item
    User-[:COMMENT]->Item
    User-[:VIEW]->Item
Cluster-User
    User-[:FACEBOOK]->SocialLogin_Cluster
Cluster-Item
    Item-[:KIND_OF]->Type_Cluster
Cluster-Cluster
    Cluster-[:KIND_OF]->Type_Cluster

{一些查询}和时间:

START u=node(467242)
MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b
WHERE NOT(a=b)
RETURN u,COUNT(b)

查询耗时 1015348 毫秒.返回 70956115 个结果计数.

Query took 1015348ms. Returned 70956115 result count.

START a=node:nodes(kind="user")
RETURN a,length(a-[:CREATED|LIKE|COMMENT|FOLLOWS]-()) AS cnt
ORDER BY cnt DESC
LIMIT 10

查询耗时 231613 毫秒

Query took 231613ms

根据建议,我将盒子升级到 M1.xlarge 和 M2.2xlarge

From the suggestions, I upgraged the box to M1.xlarge and M2.2xlarge

  • M1.xlarge(vCPU:4,ECU:8,RAM:15 GB,实例存储:~600 GB)
  • M2.2xlarge(vCPU:4,ECU:13,RAM:34 GB,实例存储:~800 GB)

我调整了如下属性,并从实例存储运行(与 EBS 相比)

I tuned the properties like below, and running from instance storage (as against EBS)

neo4j.properties

neo4j.properties

neostore.nodestore.db.mapped_memory=1800M
neostore.relationshipstore.db.mapped_memory=1800M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=150M
neostore.propertystore.db.arrays.mapped_memory=10M

neo4j-wrapper.conf

neo4j-wrapper.conf

wrapper.java.additional.1=-d64
wrapper.java.additional.1=-server
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.initmemory=4098
wrapper.java.maxmemory=8192

但查询(如下所示)仍然在大约 5-8 分钟内运行,从推荐的角度来看这是不可接受的.

but still the queries (like below) run in minutes ~5-8 minutes, which is not acceptable from recommendation point of view.

查询:

START u=node(467242)
MATCH u-[r1:LIKE]->a<-[r2:LIKE]-lu-[r3:LIKE]-b
RETURN u,COUNT(b)

<小时>

{分析}

neo4j-sh (0)$ profile START u=node(467242) MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b RETURN u,COUNT(b);
==> +-------------------------+
==> | u            | COUNT(b) |
==> +-------------------------+
==> | Node[467242] | 70960482 |
==> +-------------------------+
==> 1 row
==> 
==> ColumnFilter(symKeys=["u", "  INTERNAL_AGGREGATEad2ab10d-cfc3-48c2-bea9-be4b9c1b5595"], returnItemNames=["u", "COUNT(b)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=["u"], aggregates=["(  INTERNAL_AGGREGATEad2ab10d-cfc3-48c2-bea9-be4b9c1b5595,Count)"], _rows=1, _db_hits=0)
==>   TraversalMatcher(trail="(u)-[r1:LIKE|COMMENT WHERE true AND true]->(a)<-[r2:LIKE|COMMENT WHERE true AND true]-(lu)-[r3:LIKE WHERE true AND true]-(b)", _rows=70960482, _db_hits=71452891)
==>     ParameterPipe(_rows=1, _db_hits=0)


neo4j-sh (0)$ profile START u=node(467242) MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b RETURN count(distinct a),COUNT(distinct b),COUNT(*);
==> +--------------------------------------------------+
==> | count(distinct a) | COUNT(distinct b) | COUNT(*) |
==> +--------------------------------------------------+
==> | 1950              | 91294             | 70960482 |
==> +--------------------------------------------------+
==> 1 row
==> 
==> ColumnFilter(symKeys=["  INTERNAL_AGGREGATEe6b94644-0a55-43d9-8337-491ac0b29c8c", "  INTERNAL_AGGREGATE1cfcd797-7585-4240-84ef-eff41a59af33", "  INTERNAL_AGGREGATEea9176b2-1991-443c-bdd4-c63f4854d005"], returnItemNames=["count(distinct a)", "COUNT(distinct b)", "COUNT(*)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["(  INTERNAL_AGGREGATEe6b94644-0a55-43d9-8337-491ac0b29c8c,Distinct)", "(  INTERNAL_AGGREGATE1cfcd797-7585-4240-84ef-eff41a59af33,Distinct)", "(  INTERNAL_AGGREGATEea9176b2-1991-443c-bdd4-c63f4854d005,CountStar)"], _rows=1, _db_hits=0)
==>   TraversalMatcher(trail="(u)-[r1:LIKE|COMMENT WHERE true AND true]->(a)<-[r2:LIKE|COMMENT WHERE true AND true]-(lu)-[r3:LIKE WHERE true AND true]-(b)", _rows=70960482, _db_hits=71452891)
==>     ParameterPipe(_rows=1, _db_hits=0)

请让我知道用于调整的配置和 neo4j 启动参数.提前致谢

Please let me know the configuration and neo4j startup arguments for tuning. Thanks in advance

推荐答案

在我的 macbook air 上用你的数据集运行这个,只有很少的 RAM 和 CPU.

Running this on my macbook air with little RAM and CPU with your dataset.

有了更多的内存映射、GCR 缓存和更多的缓存堆,您将获得比我的结果快得多的结果.还要确保在查询中使用参数.

You will get much faster than my results with more memory mapping, GCR cache and more heap for caches. Also make sure to use parameters in your queries.

你正在遇到组合爆炸.

路径的每一步都会向匹配的子图中添加times rels"元素/行.

Every step of the path adds "times rels" elements/rows to your matched subgraphs.

例如在这里看到:你最终有 269268 个匹配,但你只有 81674 个不同的 lu

See for instance here: you end up at 269268 matches but you only have 81674 distinct lu's

问题在于,对于每一行,下一个匹配项都会被扩展.因此,如果您在两者之间使用不同的来再次限制大小,那么数据将减少一些数量级.下一层也一样.

The problem is that for each row the next match is expanded. So if you use distinct in between to limit the sizes again it will be some order of magnitutes less data. Same for the next level.

START u=node(467242)
MATCH u-[:LIKED|COMMENTED]->a
WITH distinct a
MATCH a<-[r2:LIKED|COMMENTED]-lu
RETURN count(*),count(distinct a),count(distinct lu);

+---------------------------------------------------+
| count(*) | count(distinct a) | count(distinct lu) |
+---------------------------------------------------+
| 269268   | 1952              | 81674              |
+---------------------------------------------------+
1 row

895 ms

START u=node(467242)
MATCH u-[:LIKED|COMMENTED]->a
WITH distinct a
MATCH a<-[:LIKED|COMMENTED]-lu
WITH distinct lu
MATCH lu-[:LIKED]-b
RETURN count(*),count(distinct lu), count(distinct b)
;
+---------------------------------------------------+
| count(*) | count(distinct lu) | count(distinct b) |
+---------------------------------------------------+
| 2311694  | 62705              | 91294             |
+---------------------------------------------------+

这里总共有 230 万个匹配项,但只有 91k 个不同的元素.所以几乎是 2 个数量级.

Here you have 2.3M total matches and only 91k distinct elements. So almost 2 orders of magnitude.

这是一个巨大的聚合,它是一个 BI/统计查询,而不是一个 OLTP 查询.通常您可以存储结果,例如在用户节点上,仅在后台重新执行.

This is a huge aggregation which is rather a BI / statistics query that an OLTP query. Usually you can store the results e.g. on the user-node and only re-execute this in the background.

THESE 类型的查询再次是全局图查询(统计/BI),在本例中为前 10 位用户.

THESE kind of queries are again global graph queries (statistics / BI ), in this case top 10 users.

通常你会在后台运行这些(例如每天或每小时一次),并将前 10 个用户节点连接到一个特殊的节点或索引,然后可以在几毫秒内进行查询.

Usually you would run these in the background (e.g. once per day or hour) and connect the top 10 user nodes to a special node or index that then can be queried in a few ms.

START a=node:nodes(kind="user") RETURN count(*);
+----------+
| count(*) |
+----------+
| 3889031  |
+----------+
1 row

27329 ms

毕竟您是在整个图上运行匹配,即 400 万用户是全局图,而不是图本地查询.

After all you are running a match across the whole graph, i.e. 4M users that's a graph global, not a graph local query.

START n=node:nodes(kind="top-user")
MATCH n-[r?:TOP_USER]-()
DELETE r
WITH distinct n
START a=node:nodes(kind="user")
MATCH a-[:CREATED|LIKED|COMMENTED|FOLLOWS]-()
WITH n, a,count(*) as cnt
ORDER BY cnt DESC
LIMIT 10
CREATE a-[:TOP_USER {count:cnt} ]->n;

+-------------------+
| No data returned. |
+-------------------+
Relationships created: 10
Properties set: 10
Relationships deleted: 10

70316 ms

然后查询将是:

START n=node:nodes(kind="top-user")
MATCH n-[r:TOP_USER]-a
RETURN a, r.count
ORDER BY r.count DESC;

+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| a                                                                                                                                                  | r.count |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
….
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows

4 ms

这篇关于调整 Neo4j 的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆