什么是最快的 ArangoDB 朋友的朋友查询(带计数) [英] What is the fastest ArangoDB friends-of-friends query (with count)

查看:25
本文介绍了什么是最快的 ArangoDB 朋友的朋友查询(带计数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 ArangoDB 来获取朋友的朋友列表.不仅仅是一个基本的朋友列表,我还想知道用户和朋友的朋友有多少共同的朋友并对结果进行排序.在多次尝试(重新)编写性能最佳的 AQL 查询之后,这就是我的最终结果:

I'm trying to use ArangoDB to get a list of friends-of-friends. Not just a basic friends-of-friends list, I also want to know how many friends the user and the friend-of-a-friend have in common and sort the result. After several attempts at (re)writing the best performing AQL query, this is what I ended up with:

LET friends = (
  FOR f IN GRAPH_NEIGHBORS('graph', @user, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
  RETURN f._id
)

LET foafs = (FOR friend IN friends
  FOR foaf in GRAPH_NEIGHBORS('graph', friend, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
    FILTER foaf._id != @user AND foaf._id NOT IN friends
    COLLECT foaf_result = foaf WITH COUNT INTO common_friend_count
    RETURN {
      user: foaf_result,
      common_friend_count: common_friend_count
    }
)
FOR foaf IN foafs
  SORT foaf.common_friend_count DESC
  RETURN foaf

不幸的是,性能不如我希望的那么好.与相同查询(和数据)的 Neo4j 版本相比,AQL 似乎慢了很多(5-10 倍).

Unfortunately, performance is not as good as I would've liked. Compared to the Neo4j versions of the same query(and data), AQL seems quite a bit slower (5-10x).

我想知道的是...我如何改进我们的查询以使其性能更好?

What I'd like to know is... How can I improve our query to make it perform better?

推荐答案

我是 ArangoDB 的核心开发人员之一,并试图优化您的查询.由于我没有您的 dataset,我只能谈谈我的测试 dataset,如果您能验证我的结果,我会很高兴.

I am one of the core developers of ArangoDB and tried to optimize your query. As I do not have your dataset I can only talk about my test dataset and would be happy to hear if you can validate my results.

首先,如果我在 ArangoDB 2.7 上运行,但在这种特殊情况下,我不希望与 2.6 有很大的性能差异.

First if all I am running on ArangoDB 2.7 but in this particular case I do not expect a major performance difference to 2.6.

在我的 dataset 中,我可以在大约 7 秒内执行您的查询.第一个修复:在您的朋友声明中,您使用 includeData: true 并且只返回 _id.用includeData: false GRAPH_NEIGHBORS 直接返回_id,我们也可以去掉这里的子查询

In my dataset I could execute your query as it is in ~7sec. First fix: In your friends statement you use includeData: true and only return the _id. With includeData: false GRAPH_NEIGHBORS directly returns the _id and we can also get rid of the subquery here

LET friends = GRAPH_NEIGHBORS('graph', 
                              @user,
                              {"direction": "any",
                               "edgeExamples": { 
                                   name: "FRIENDS_WITH"
               }})

这在我的机器上缩短到了大约 1.1 秒.所以我预计这会接近于 Neo4J 的性能.

This got it down to ~ 1.1 sec on my machine. So I expect that this will be close to the performance of Neo4J.

为什么这会产生很大的影响?在内部,我们首先找到 _id 值而不实际加载文档 JSON.在您的查询中,您不需要任何这些数据,因此我们可以放心地继续而不打开它.

Why does this have a high impact? Internally we first find the _id value without actually loading the documents JSON. In your query you do not need any of this data, so we can safely continue with not opening it.

但现在是真正的改进

您的查询采用逻辑"方式,首先获取用户的邻居,然后找到他们的邻居,计算找到 foaf 的频率并对其进行排序.这就得在内存中建立完整的foaf网络,并将其作为一个整体进行排序.

Your query goes the "logical" way and first gets users neighbors, than finds their neighbors, counts how often a foaf is found and sorts it. This has to build up the complete foaf network in memory and sort it as a whole.

您也可以采用不同的方式:1. 查找用户的所有好友(仅_ids)2. 查找所有foaf(完整文档)3.对于每个foaf,找到所有的foaf_friends(仅_ids)4. 找到friendsfoaf_friends 的交集并计数它们

You can also do it in a different way: 1. Find all friends of user (only _ids) 2. Find all foaf (complete document) 3. For each foaf find all foaf_friends (only _ids) 4. Find the intersection of friends and foaf_friends and COUNT them

这个查询是这样的:

LET fids = GRAPH_NEIGHBORS("graph",
                           @user,
                           {
                             "direction":"any",
                             "edgeExamples": {
                               "name": "FRIENDS_WITH"
                              }
                           }
                          )
FOR foaf IN GRAPH_NEIGHBORS("graph",
                            @user,
                            {
                              "minDepth": 2,
                              "maxDepth": 2,
                              "direction": "any",
                              "includeData": true,
                              "edgeExamples": {
                                "name": "FRIENDS_WITH"
                              }
                            }
                           )
  LET commonIds = GRAPH_NEIGHBORS("graph",
                                  foaf._id, {
                                    "direction": "any",
                                    "edgeExamples": {
                                      "name": "FRIENDS_WITH"
                                     }
                                  }
                                 )
  LET common_friend_count = LENGTH(INTERSECTION(fids, commonIds))
  SORT common_friend_count DESC
  RETURN {user: foaf, common_friend_count: common_friend_count}

我的测试图中的执行时间约为 0.024 秒

Which in my test graph was executed in ~ 0.024 sec

所以这给了我 250 倍的速度 执行时间,我希望这比您在 Neo4j 中的当前查询更快,但因为我没有您的 dataset我无法验证,如果你能做到并告诉我就好了.

So this gave me a factor 250 faster execution time and I would expect this to be faster than your current query in Neo4j, but as I do not have your dataset I can not verify it, it would be good if you could do it and tell me.

最后一件事

edgeExamples: {name : "FRIENDS_WITH" }includeData 一样,在这种情况下我们必须找到真正的边并查看它.如果您根据名称将边缘存储在单独的集合中,则可以避免这种情况.然后也删除 edgeExamples .这将进一步提高性能(特别是在有很多边缘的情况下).

With the edgeExamples: {name : "FRIENDS_WITH" } it is the same as with includeData, in this case we have to find the real edge and look into it. This could be avoided if you store your edges in separate collections based on their name. And then remove the edgeExamples as well. This will further increase the performance (especially if there are a lot of edges).

未来

请继续关注我们的下一个版本,我们现在正在向 AQL 添加更多功能,这将使您的案例更容易查询,并且应该会再次提升性能.

Stay tuned for our next release, we are right now adding some more functionality to AQL which will make your case much easier to query and should give another performance boost.

这篇关于什么是最快的 ArangoDB 朋友的朋友查询(带计数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆