通过Tinkerpop使用特定模型将(数百万行)数据获取到Janusgraph中的最佳方法 [英] Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model
问题描述
刚从Tinkerpop和Janusgraph开始,我正试图根据文档来弄清楚这一点.
Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
- 我有三个数据集,每个数据集包含约2000万行(csv文件)
- 有一个特定的模型,其中变量和行需要连接,例如什么是顶点,什么是标签,什么是边等等.
- 在图表中包含所有内容之后,我当然想使用一些基本的Gremlin来查看模型的运行情况.
但是首先,我需要一种将数据导入Janusgraph的方法.
But first I need a way to get the data into Janusgraph.
可能存在用于此目的的脚本. 但是否则,也许是用python编写的东西,打开一个csv文件,获取变量X的每一行,并将其添加为一个vertex/edge/etc. ...? 还是我完全误解了Janusgraph/Tinkerpop?
Possibly there exist scripts for this. But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...? Or am I completely misinterpreting Janusgraph/Tinkerpop?
非常感谢您提前提供帮助.
Thanks for any help in advance.
说我有几个文件,每个文件包含代表人的几百万行和代表不同度量的几个变量.第一个示例可能看起来像thid:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
我应该将其转换为具有仅由值[a,...,l]组成的节点的文件. (以及稍后可能更详细的属性集)
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l]. (and later perhaps more elaborate sets of properties)
然后索引[a,...,l]吗?
And are [a,..., l] then indexed?
现代"图此处似乎有一个索引(所有节点和边的编号为1,...,12,独立于其重叠的标签/类别),例如是否应该分别对每个度量进行索引,然后将其链接到它们所属的给定person_x?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
为这些可能直截了当的问题表示歉意,但是我对此并不陌生.
Apologies for these probably straightforward questions, but I'm fairly new to this.
推荐答案
JanusGraph使用可插拔存储后端和索引.为了进行测试,该发行版中打包了一个名为bin/janusgraph.sh
的脚本.它可以通过启动Cassandra和Elasticsearch来快速启动并运行(它也可以启动gremlin服务器,但我们不会使用它)
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh
is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
然后,我建议使用Groovy脚本加载数据. Groovy脚本可以通过Gremlin控制台执行
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
加载数据的有效方法是将其拆分为两个文件:
An efficient way to load the data is to split it into two files:
- nodes.csv:具有所有属性的每个节点一行
- links.csv:每个链接包含
source_id
和target_id
以及所有链接属性 的一行
- nodes.csv: one line per node with all attributes
- links.csv: one line per link with
source_id
andtarget_id
and all the links attributes
这可能需要一些数据准备步骤.
This might require some data preparation steps.
这是一个示例脚本
加快此过程的技巧是在您的ID与节点创建期间JanusGraph创建的ID之间保持映射.
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
即使不是强制性的,我也强烈建议您在加载任何数据之前为图形创建一个显式架构.这是一个示例脚本
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script
这篇关于通过Tinkerpop使用特定模型将(数百万行)数据获取到Janusgraph中的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!