不确定如何使用现有集合中的列创建 ArangoDB 图 [英] Not sure how to create ArangoDB graph using columns in existing collection

查看:35
本文介绍了不确定如何使用现有集合中的列创建 ArangoDB 图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含三个字段的rocksdb集合:_id、author、subreddit.

I have a rocksdb collection that contains three fields: _id, author, subreddit.

我想创建一个 Arango 图来创建一个连接这两个现有列的图.但是示例和驱动程序似乎只接受集合作为其边缘定义.

I would like to create a Arango graph that creates a graph connecting these two existing columns. But the examples and the drivers seem to only accept collections as its edge definitions.

ArangoDb 文档缺乏关于如何使用从同一集合中提取的边和节点创建图形的信息.

The ArangoDb documentation is lacking information on how I can create a graph using edges and nodes pulled from the same collection.

此问题已通过此 Arangodb 问题单.

推荐答案

请注意,以下查询在这个庞大的数据集上需要一段时间才能完成,但它们应该会在几个小时后成功完成.

Please note that the following queries take a while to complete on this huge dataset, however they should complete sucessfully after some hours.

我们启动 arangoimp 以导入我们的基础数据集:

We start the arangoimp to import our base dataset:

arangoimp --create-collection true  --collection RawSubReddits --type jsonl ./RC_2017-01 

我们使用 arangosh 创建我们的最终数据所在的集合:

We use arangosh to create the collections where our final data is going to live in:

db._create("authors")
db._createEdgeCollection("authorsToSubreddits")

我们通过简单地忽略任何随后出现的重复作者来填充作者集合;我们将使用 _key">MD5 函数,所以它遵守 _key 中允许字符的限制,我们可以稍后通过在 author 字段上再次调用 MD5() 来知道它:

We fill the authors collection by simply ignoring any subsequently occuring duplicate authors; We will calculate the _key of the author by using the MD5 function, so it obeys the restrictions for allowed chars in _key, and we can know it later on by calling MD5() again on the author field:

db._query(`
  FOR item IN RawSubReddits
    INSERT {
      _key: MD5(item.author),
      author: item.author
      } INTO authors
        OPTIONS { ignoreErrors: true }`);

在我们填充了第二个顶点集合(我们将导入的集合作为第一个顶点集合)之后,我们必须计算边.由于每个作者都可以创建多个 subred,因此最有可能是来自每个作者的多个边.就像之前提到的,我们可以再次使用 MD5() 函数来引用之前创建的作者:

After the we have filled the second vertex collection (we will keep the imported collection as the first vertex collection) we have to calculate the edges. Since each author can have created several subreds, its most probably going to be several edges originating from each author. As previously mentioned, we can use the MD5()-function again to reference the author previously created:

 db._query(`
   FOR onesubred IN RawSubReddits
     INSERT {
       _from: CONCAT('authors/', MD5(onesubred.author)),
       _to: CONCAT('RawSubReddits/', onesubred._key)
     } INTO  authorsToSubreddits")

在边集合被填充后(这可能又需要一段时间 - 我们这里讨论的是 4000 万条边,对吧? - 我们创建图形描述:

After the edge collection is filled (which may again take a while - we're talking about 40 million edges herer, right? - we create the graph description:

db._graphs.save({
  "_key": "reddits",
  "orphanCollections" : [ ],
  "edgeDefinitions" : [ 
    {
      "collection": "authorsToSubreddits",
      "from": ["authors"],
      "to": ["RawSubReddits"]
    }
  ]
})

我们现在可以使用 UI 来浏览图形,或者使用 AQL 查询来浏览图形.让我们从该列表中随机选择第一作者:

We now can use the UI to browse the graphs, or use AQL queries to browse the graph. Lets pick the sort of random first author from that list:

db._query(`for author IN authors LIMIT 1 RETURN author`).toArray()
[ 
  { 
    "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
    "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
    "_rev" : "_W_Eu-----_", 
    "author" : "punchyourbuns" 
  } 
]

我们确定了一个作者,现在运行一个图形查询 给他:

We identified an author, and now run a graph query for him:

db._query(`FOR vertex, edge, path IN 0..1
   OUTBOUND 'authors/1cec812d4e44b95e5a11f3cbb15f7980'
   GRAPH 'reddits'
   RETURN path`).toArray()

生成的路径之一如下所示:

One of the resulting paths looks like that:

{ 
  "edges" : [ 
    { 
      "_key" : "128327199", 
      "_id" : "authorsToSubreddits/128327199", 
      "_from" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_to" : "RawSubReddits/38026350", 
      "_rev" : "_W_LOxgm--F" 
    } 
  ], 
  "vertices" : [ 
    { 
      "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
      "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_rev" : "_W_HAL-y--_", 
      "author" : "punchyourbuns" 
    }, 
    { 
      "_key" : "38026350", 
      "_id" : "RawSubReddits/38026350", 
      "_rev" : "_W-JS0na--b", 
      "distinguished" : null, 
      "created_utc" : 1484537478, 
      "id" : "dchfe6e", 
      "edited" : false, 
      "parent_id" : "t1_dch51v3", 
      "body" : "I don't understand tension at all."
         "Mine is set to auto."
         "I'll replace the needle and rethread. Thanks!", 
      "stickied" : false, 
      "gilded" : 0, 
      "subreddit" : "sewing", 
      "author" : "punchyourbuns", 
      "score" : 3, 
      "link_id" : "t3_5o66d0", 
      "author_flair_text" : null, 
      "author_flair_css_class" : null, 
      "controversiality" : 0, 
      "retrieved_on" : 1486085797, 
      "subreddit_id" : "t5_2sczp" 
    } 
  ] 
}

这篇关于不确定如何使用现有集合中的列创建 ArangoDB 图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆