通过Tinkerpop使用特定模型将(数百万行)数据获取到Janusgraph中的最佳方法 [英] Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model

查看:200
本文介绍了通过Tinkerpop使用特定模型将(数百万行)数据获取到Janusgraph中的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

刚从Tinkerpop和Janusgraph开始,我正试图根据文档来弄清楚这一点.

Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.

  • 我有三个数据集,每个数据集包含约2000万行(csv文件)
  • 有一个特定的模型,其中变量和行需要连接,例如什么是顶点,什么是标签,什么是边等等.
  • 在图表中包含所有内容之后,我当然想使用一些基本的Gremlin来查看模型的运行情况.

但是首先,我需要一种将数据导入Janusgraph的方法.

But first I need a way to get the data into Janusgraph.

可能存在用于此目的的脚本. 但是否则,也许是用python编写的东西,打开一个csv文件,获取变量X的每一行,并将其添加为一个vertex/edge/etc. ...? 还是我完全误解了Janusgraph/Tinkerpop?

Possibly there exist scripts for this. But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...? Or am I completely misinterpreting Janusgraph/Tinkerpop?

非常感谢您提前提供帮助.

Thanks for any help in advance.

说我有几个文件,每个文件包含代表人的几百万行和代表不同度量的几个变量.第一个示例可能看起来像thid:

Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:

             metric_1    metric_2    metric_3    ..

person_1        a           e           i
person_2        b           f           j
person_3        c           g           k
person_4        d           h           l
..        

我应该将其转换为具有仅由值[a,...,l]组成的节点的文件. (以及稍后可能更详细的属性集)

Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l]. (and later perhaps more elaborate sets of properties)

然后索引[a,...,l]吗?

And are [a,..., l] then indexed?

现代"图此处似乎有一个索引(所有节点和边的编号为1,...,12,独立于其重叠的标签/类别),例如是否应该分别对每个度量进行索引,然后将其链接到它们所属的给定person_x?

The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?

为这些可能直截了当的问题表示歉意,但是我对此并不陌生.

Apologies for these probably straightforward questions, but I'm fairly new to this.

推荐答案

JanusGraph使用可插拔存储后端和索引.为了进行测试,该发行版中打包了一个名为bin/janusgraph.sh的脚本.它可以通过启动Cassandra和Elasticsearch来快速启动并运行(它也可以启动gremlin服务器,但我们不会使用它)

JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)

cd /path/to/janus
bin/janusgraph.sh start

然后,我建议使用Groovy脚本加载数据. Groovy脚本可以通过Gremlin控制台执行

Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console

bin/gremlin.sh -e scripts/load_data.script 

加载数据的有效方法是将其拆分为两个文件:

An efficient way to load the data is to split it into two files:

  • nodes.csv:具有所有属性的每个节点一行
  • links.csv:每个链接包含source_idtarget_id以及所有链接属性
  • 的一行
  • nodes.csv: one line per node with all attributes
  • links.csv: one line per link with source_id and target_id and all the links attributes

这可能需要一些数据准备步骤.

This might require some data preparation steps.

这是一个示例脚本

加快此过程的技巧是在您的ID与节点创建期间JanusGraph创建的ID之间保持映射.

The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.

即使不是强制性的,我也强烈建议您在加载任何数据之前为图形创建一个显式架构.这是一个示例脚本

Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script

这篇关于通过Tinkerpop使用特定模型将(数百万行)数据获取到Janusgraph中的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆