不确定如何使用现有集合中的列创建ArangoDB图 [英] Not sure how to create ArangoDB graph using columns in existing collection

查看:155
本文介绍了不确定如何使用现有集合中的列创建ArangoDB图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个rocksdb集合,其中包含三个字段:_id,author,subreddit.

I have a rocksdb collection that contains three fields: _id, author, subreddit.

我想创建一个Arango图,该图创建一个连接这两个现有列的图.但是这些示例和驱动程序似乎只接受集合作为其边缘定义.

I would like to create a Arango graph that creates a graph connecting these two existing columns. But the examples and the drivers seem to only accept collections as its edge definitions.

ArangoDb文档缺少有关如何使用从同一集合中提取的边和节点创建图形的信息.

The ArangoDb documentation is lacking information on how I can create a graph using edges and nodes pulled from the same collection.

此问题已通过在 Arangodb发出故障单.

推荐答案

请注意,以下查询需要一些时间才能在此庞大的数据集上完成,但是,它们应该在几个小时后成功完成.

Please note that the following queries take a while to complete on this huge dataset, however they should complete sucessfully after some hours.

我们启动arangoimp导入我们的基本数据集:

We start the arangoimp to import our base dataset:

arangoimp --create-collection true  --collection RawSubReddits --type jsonl ./RC_2017-01 

我们使用arangosh创建将最终数据保存在其中的集合:

We use arangosh to create the collections where our final data is going to live in:

db._create("authors")
db._createEdgeCollection("authorsToSubreddits")

我们通过简单地忽略随后出现的重复作者来填充作者集合; 我们将使用 MD5 函数, 因此它遵守_key中允许的字符的限制,稍后我们可以通过再次在author字段上调用MD5()来了解它:

We fill the authors collection by simply ignoring any subsequently occuring duplicate authors; We will calculate the _key of the author by using the MD5 function, so it obeys the restrictions for allowed chars in _key, and we can know it later on by calling MD5() again on the author field:

db._query(`
  FOR item IN RawSubReddits
    INSERT {
      _key: MD5(item.author),
      author: item.author
      } INTO authors
        OPTIONS { ignoreErrors: true }`);

在我们填充了第二个顶点集合(我们将导入的集合保留为第一个顶点集合)之后,我们必须计算边缘. 由于每个作者都可以创建多个子项,因此最有可能是源自每个作者的多个边.就像之前提到的, 我们可以再次使用MD5()函数来引用先前创建的作者:

After the we have filled the second vertex collection (we will keep the imported collection as the first vertex collection) we have to calculate the edges. Since each author can have created several subreds, its most probably going to be several edges originating from each author. As previously mentioned, we can use the MD5()-function again to reference the author previously created:

 db._query(`
   FOR onesubred IN RawSubReddits
     INSERT {
       _from: CONCAT('authors/', MD5(onesubred.author)),
       _to: CONCAT('RawSubReddits/', onesubred._key)
     } INTO  authorsToSubreddits")

在填充完边缘集合后(可能还要花一些时间-我们说的是4000万条边缘游标,对吗?-我们创建图形描述:

After the edge collection is filled (which may again take a while - we're talking about 40 million edges herer, right? - we create the graph description:

db._graphs.save({
  "_key": "reddits",
  "orphanCollections" : [ ],
  "edgeDefinitions" : [ 
    {
      "collection": "authorsToSubreddits",
      "from": ["authors"],
      "to": ["RawSubReddits"]
    }
  ]
})

我们现在可以使用UI来浏览图形,或使用AQL查询来浏览图形.让我们从该列表中选择随机的第一作者:

We now can use the UI to browse the graphs, or use AQL queries to browse the graph. Lets pick the sort of random first author from that list:

db._query(`for author IN authors LIMIT 1 RETURN author`).toArray()
[ 
  { 
    "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
    "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
    "_rev" : "_W_Eu-----_", 
    "author" : "punchyourbuns" 
  } 
]

我们确定了一位作者,现在运行图形查询给他:

We identified an author, and now run a graph query for him:

db._query(`FOR vertex, edge, path IN 0..1
   OUTBOUND 'authors/1cec812d4e44b95e5a11f3cbb15f7980'
   GRAPH 'reddits'
   RETURN path`).toArray()

其中一个结果路径如下:

One of the resulting paths looks like that:

{ 
  "edges" : [ 
    { 
      "_key" : "128327199", 
      "_id" : "authorsToSubreddits/128327199", 
      "_from" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_to" : "RawSubReddits/38026350", 
      "_rev" : "_W_LOxgm--F" 
    } 
  ], 
  "vertices" : [ 
    { 
      "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
      "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_rev" : "_W_HAL-y--_", 
      "author" : "punchyourbuns" 
    }, 
    { 
      "_key" : "38026350", 
      "_id" : "RawSubReddits/38026350", 
      "_rev" : "_W-JS0na--b", 
      "distinguished" : null, 
      "created_utc" : 1484537478, 
      "id" : "dchfe6e", 
      "edited" : false, 
      "parent_id" : "t1_dch51v3", 
      "body" : "I don't understand tension at all."
         "Mine is set to auto."
         "I'll replace the needle and rethread. Thanks!", 
      "stickied" : false, 
      "gilded" : 0, 
      "subreddit" : "sewing", 
      "author" : "punchyourbuns", 
      "score" : 3, 
      "link_id" : "t3_5o66d0", 
      "author_flair_text" : null, 
      "author_flair_css_class" : null, 
      "controversiality" : 0, 
      "retrieved_on" : 1486085797, 
      "subreddit_id" : "t5_2sczp" 
    } 
  ] 
}

这篇关于不确定如何使用现有集合中的列创建ArangoDB图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆