不确定如何使用现有集合中的列创建 ArangoDB 图 [英] Not sure how to create ArangoDB graph using columns in existing collection
问题描述
我有一个包含三个字段的rocksdb集合:_id、author、subreddit.
I have a rocksdb collection that contains three fields: _id, author, subreddit.
我想创建一个 Arango 图来创建一个连接这两个现有列的图.但是示例和驱动程序似乎只接受集合作为其边缘定义.
I would like to create a Arango graph that creates a graph connecting these two existing columns. But the examples and the drivers seem to only accept collections as its edge definitions.
ArangoDb 文档缺乏关于如何使用从同一集合中提取的边和节点创建图形的信息.
The ArangoDb documentation is lacking information on how I can create a graph using edges and nodes pulled from the same collection.
此问题已通过此 Arangodb 问题单.
推荐答案
请注意,以下查询在这个庞大的数据集上需要一段时间才能完成,但它们应该会在几个小时后成功完成.
Please note that the following queries take a while to complete on this huge dataset, however they should complete sucessfully after some hours.
我们启动 arangoimp 以导入我们的基础数据集:
We start the arangoimp to import our base dataset:
arangoimp --create-collection true --collection RawSubReddits --type jsonl ./RC_2017-01
我们使用 arangosh 创建我们的最终数据所在的集合:
We use arangosh to create the collections where our final data is going to live in:
db._create("authors")
db._createEdgeCollection("authorsToSubreddits")
我们通过简单地忽略任何随后出现的重复作者来填充作者集合;我们将使用 _key">MD5
函数,所以它遵守 _key
中允许字符的限制,我们可以稍后通过在 author
字段上再次调用 MD5()
来知道它:
We fill the authors collection by simply ignoring any subsequently occuring duplicate authors;
We will calculate the _key
of the author by using the MD5
function,
so it obeys the restrictions for allowed chars in _key
, and we can know it later on by calling MD5()
again on the author
field:
db._query(`
FOR item IN RawSubReddits
INSERT {
_key: MD5(item.author),
author: item.author
} INTO authors
OPTIONS { ignoreErrors: true }`);
在我们填充了第二个顶点集合(我们将导入的集合作为第一个顶点集合)之后,我们必须计算边.由于每个作者都可以创建多个 subred,因此最有可能是来自每个作者的多个边.就像之前提到的,我们可以再次使用 MD5()
函数来引用之前创建的作者:
After the we have filled the second vertex collection (we will keep the imported collection as the first vertex collection) we have to calculate the edges.
Since each author can have created several subreds, its most probably going to be several edges originating from each author. As previously mentioned,
we can use the MD5()
-function again to reference the author previously created:
db._query(`
FOR onesubred IN RawSubReddits
INSERT {
_from: CONCAT('authors/', MD5(onesubred.author)),
_to: CONCAT('RawSubReddits/', onesubred._key)
} INTO authorsToSubreddits")
在边集合被填充后(这可能又需要一段时间 - 我们这里讨论的是 4000 万条边,对吧? - 我们创建图形描述:
After the edge collection is filled (which may again take a while - we're talking about 40 million edges herer, right? - we create the graph description:
db._graphs.save({
"_key": "reddits",
"orphanCollections" : [ ],
"edgeDefinitions" : [
{
"collection": "authorsToSubreddits",
"from": ["authors"],
"to": ["RawSubReddits"]
}
]
})
我们现在可以使用 UI 来浏览图形,或者使用 AQL 查询来浏览图形.让我们从该列表中随机选择第一作者:
We now can use the UI to browse the graphs, or use AQL queries to browse the graph. Lets pick the sort of random first author from that list:
db._query(`for author IN authors LIMIT 1 RETURN author`).toArray()
[
{
"_key" : "1cec812d4e44b95e5a11f3cbb15f7980",
"_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980",
"_rev" : "_W_Eu-----_",
"author" : "punchyourbuns"
}
]
我们确定了一个作者,现在运行一个图形查询 给他:
We identified an author, and now run a graph query for him:
db._query(`FOR vertex, edge, path IN 0..1
OUTBOUND 'authors/1cec812d4e44b95e5a11f3cbb15f7980'
GRAPH 'reddits'
RETURN path`).toArray()
生成的路径之一如下所示:
One of the resulting paths looks like that:
{
"edges" : [
{
"_key" : "128327199",
"_id" : "authorsToSubreddits/128327199",
"_from" : "authors/1cec812d4e44b95e5a11f3cbb15f7980",
"_to" : "RawSubReddits/38026350",
"_rev" : "_W_LOxgm--F"
}
],
"vertices" : [
{
"_key" : "1cec812d4e44b95e5a11f3cbb15f7980",
"_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980",
"_rev" : "_W_HAL-y--_",
"author" : "punchyourbuns"
},
{
"_key" : "38026350",
"_id" : "RawSubReddits/38026350",
"_rev" : "_W-JS0na--b",
"distinguished" : null,
"created_utc" : 1484537478,
"id" : "dchfe6e",
"edited" : false,
"parent_id" : "t1_dch51v3",
"body" : "I don't understand tension at all."
"Mine is set to auto."
"I'll replace the needle and rethread. Thanks!",
"stickied" : false,
"gilded" : 0,
"subreddit" : "sewing",
"author" : "punchyourbuns",
"score" : 3,
"link_id" : "t3_5o66d0",
"author_flair_text" : null,
"author_flair_css_class" : null,
"controversiality" : 0,
"retrieved_on" : 1486085797,
"subreddit_id" : "t5_2sczp"
}
]
}
这篇关于不确定如何使用现有集合中的列创建 ArangoDB 图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!