什么是最快的ArangoDB好友查询(带计数) [英] What is the fastest ArangoDB friends-of-friends query (with count)

查看:362
本文介绍了什么是最快的ArangoDB好友查询(带计数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用ArangoDB获取好友列表.我不仅想了解基本的好友列表,还想知道用户和好友的朋友共有多少个朋友,并对结果进行排序. 经过几次尝试(重新)编写性能最佳的AQL查询,这就是我的最终结果:

I'm trying to use ArangoDB to get a list of friends-of-friends. Not just a basic friends-of-friends list, I also want to know how many friends the user and the friend-of-a-friend have in common and sort the result. After several attempts at (re)writing the best performing AQL query, this is what I ended up with:

LET friends = (
  FOR f IN GRAPH_NEIGHBORS('graph', @user, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
  RETURN f._id
)

LET foafs = (FOR friend IN friends
  FOR foaf in GRAPH_NEIGHBORS('graph', friend, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
    FILTER foaf._id != @user AND foaf._id NOT IN friends
    COLLECT foaf_result = foaf WITH COUNT INTO common_friend_count
    RETURN {
      user: foaf_result,
      common_friend_count: common_friend_count
    }
)
FOR foaf IN foafs
  SORT foaf.common_friend_count DESC
  RETURN foaf

不幸的是,性能没有我想要的好.与相同查询(和数据)的Neo4j版本相比,AQL似乎要慢得多(5-10倍).

Unfortunately, performance is not as good as I would've liked. Compared to the Neo4j versions of the same query(and data), AQL seems quite a bit slower (5-10x).

我想知道的是...如何改善查询以使其表现更好?

What I'd like to know is... How can I improve our query to make it perform better?

推荐答案

我是ArangoDB的核心开发人员之一,并试图优化您的查询.由于我没有您的dataset,所以我只能谈论我的测试dataset,很高兴听到您是否可以验证我的结果.

I am one of the core developers of ArangoDB and tried to optimize your query. As I do not have your dataset I can only talk about my test dataset and would be happy to hear if you can validate my results.

首先,如果我全部运行在ArangoDB 2.7上,但是在这种情况下,我预计性能不会与2.6出现重大差异.

First if all I am running on ArangoDB 2.7 but in this particular case I do not expect a major performance difference to 2.6.

在我的dataset中,我可以在大约7秒内执行查询. 第一个解决方法: 在您的好友声明中,您使用includeData: true且仅返回_id.使用includeData: false GRAPH_NEIGHBORS直接返回_id,我们也可以在这里摆脱子查询

In my dataset I could execute your query as it is in ~7sec. First fix: In your friends statement you use includeData: true and only return the _id. With includeData: false GRAPH_NEIGHBORS directly returns the _id and we can also get rid of the subquery here

LET friends = GRAPH_NEIGHBORS('graph', 
                              @user,
                              {"direction": "any",
                               "edgeExamples": { 
                                   name: "FRIENDS_WITH"
               }})

这使我的机器上的时间降低到了约1.1秒.因此,我希望这将接近Neo4J的性能.

This got it down to ~ 1.1 sec on my machine. So I expect that this will be close to the performance of Neo4J.

为什么会产生很大的影响? 在内部,我们首先找到_id值,而不实际加载文档JSON.在您的查询中,您不需要任何这些数据,因此我们可以安全地继续操作而不打开它们.

Why does this have a high impact? Internally we first find the _id value without actually loading the documents JSON. In your query you do not need any of this data, so we can safely continue with not opening it.

但现在需要真正的改进

您的查询采用逻辑"方式,首先使用户成为邻居,而不是找到他们的邻居,计算找到foaf的频率并对其进行排序. 这必须在内存中建立完整的foaf网络并对其进行整体排序.

Your query goes the "logical" way and first gets users neighbors, than finds their neighbors, counts how often a foaf is found and sorts it. This has to build up the complete foaf network in memory and sort it as a whole.

您还可以采用其他方式进行操作: 1.查找用户的所有friends(仅_ids) 2.查找所有foaf(完整文档) 3.对于每个foaf,找到所有foaf_friends(仅_ids) 4.找到friendsfoaf_friends的交集并计数

You can also do it in a different way: 1. Find all friends of user (only _ids) 2. Find all foaf (complete document) 3. For each foaf find all foaf_friends (only _ids) 4. Find the intersection of friends and foaf_friends and COUNT them

此查询如下:

LET fids = GRAPH_NEIGHBORS("graph",
                           @user,
                           {
                             "direction":"any",
                             "edgeExamples": {
                               "name": "FRIENDS_WITH"
                              }
                           }
                          )
FOR foaf IN GRAPH_NEIGHBORS("graph",
                            @user,
                            {
                              "minDepth": 2,
                              "maxDepth": 2,
                              "direction": "any",
                              "includeData": true,
                              "edgeExamples": {
                                "name": "FRIENDS_WITH"
                              }
                            }
                           )
  LET commonIds = GRAPH_NEIGHBORS("graph",
                                  foaf._id, {
                                    "direction": "any",
                                    "edgeExamples": {
                                      "name": "FRIENDS_WITH"
                                     }
                                  }
                                 )
  LET common_friend_count = LENGTH(INTERSECTION(fids, commonIds))
  SORT common_friend_count DESC
  RETURN {user: foaf, common_friend_count: common_friend_count}

我的测试图中哪个执行时间约为0.024秒

Which in my test graph was executed in ~ 0.024 sec

所以这给了我 250倍的执行时间,我希望它比您在Neo4j中当前的查询要快,但是由于我没有您的dataset我无法验证,请告诉我.

So this gave me a factor 250 faster execution time and I would expect this to be faster than your current query in Neo4j, but as I do not have your dataset I can not verify it, it would be good if you could do it and tell me.

最后一件事

使用edgeExamples: {name : "FRIENDS_WITH" }includeData相同,在这种情况下,我们必须找到真实的边并对其进行观察.如果根据边缘的名称将边缘存储在单独的集合中,则可以避免这种情况.然后也删除edgeExamples.这将进一步提高性能(尤其是在边缘很多的情况下).

With the edgeExamples: {name : "FRIENDS_WITH" } it is the same as with includeData, in this case we have to find the real edge and look into it. This could be avoided if you store your edges in separate collections based on their name. And then remove the edgeExamples as well. This will further increase the performance (especially if there are a lot of edges).

未来

请继续关注我们的下一个版本,我们现在正在为AQL添加更多功能,这将使您的案件更容易查询,并且应该进一步提高性能.

Stay tuned for our next release, we are right now adding some more functionality to AQL which will make your case much easier to query and should give another performance boost.

这篇关于什么是最快的ArangoDB好友查询(带计数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆