Orientdb最快批量导入 [英] Orientdb fastest batchimport

查看:39
本文介绍了Orientdb最快批量导入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到将边从 CSV 导入 OrientDB Graph 的最快方法.(我的 OrientDB 版本是 2.1.15.)

I'm trying to find the fastest way to import edges to OrientDB Graph from CSV. (My OrientDB version is 2.1.15.)

现在我有一个包含 10 万个顶点和 150 万条边的图.很快我将把它的大小增加到 100M 顶点和 100B+ 边,我不想等到导入结束几个月:)

Now I have a graph with 100k Vertices and 1,5M Edges. Soon I will increase its size to 100M Vertices and 100B+ Edges and I don't want to wait till import ends for months :)

我尝试用不同的方式来做到这一点:

I've tried to do it with different ways:

  1. 默认 JSON ETL.边缘加载速率约为 200-300 行/秒.很慢,它工作大约 1,5 小时.尝试更改Tx"模式和其他属性,但未对性能进行任何更改.

  1. Default JSON ETL. Edges load rate is about 200-300 rows/sec. Very slow, it works about 1,5h. Tried to change "Tx" mode and other properties, it didnt make any changes in perfomance.

使用 BatchGraph 类的 Java 代码.我在这里为事务尝试了不同的缓冲区大小,大小为 10 时实现了最佳性能.但它对我来说仍然很慢:大约 45m.

Java Code using class BatchGraph. I tried different Buffer sizes for transactions here, best perfomance was achieved with size 10. But still it works slow for me: about 45m.

从控制台导入特殊的 JSON 格式(IMPORT DATABASE 命令).(顺便说一句,对于我的任务,它不如前两个好.)但它也很慢 - 大约 1 小时.

Import special JSON format from console (IMPORT DATABASE command). (By the way it is not as good as previous two are for my task.) But it is very slow too - about 1h.

那么,是否有任何可能在短时间内在 OrientDB 中导入这样的 Graph(1.5M 边)?对我来说很完美:不到 1 分钟.请告诉我,如果我能以某种方式改进我的代码.

So, Are there any possibilities to import such Graph(1.5M Edges) in OrientDB in a short time? Perfect for me: less than 1 minute. Please, tell me, if i can improve somehow my code.

我的json:

{
  "source": { "file": { "path": "/opt/orientdb/orientdb-community-2.1.15/bin/csv/1_1500k_edges.csv" } },
  "extractor": { "csv": {} },
  "transformers": [
    { "merge": { "joinFieldName": "ids", "lookup": "V.id" } },
    { "vertex": { "class": "V" } },
        { "edge": { "class": "Edges",
                "joinFieldName": "ide",
                "lookup": "V.id",
                "direction": "out",
                "edgeFields": { "val": "${input.val}" },
                "unresolvedLinkAction": "CREATE"} }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:localhost/graph",
       "dbType": "graph",
       "wal":false,
       "tx":true,
       "batchCommit":1000,
       "standardElementConstraints": false,
        "classes": [
         {"name": "V"},
         {"name": "Edges", "extends": "E"}
       ], "indexes": [
         {"class":"V", "fields":["id:integer"], "type":"UNIQUE" }
       ]
    }
  }
}

Java 代码:

this.graph = new OrientGraph(this.host, this.name, this.pass);
this.graph.setStandardElementConstraints(false);
this.graph.declareIntent(new OIntentMassiveInsert());
BatchGraph<OrientGraph> bgraph = new BatchGraph<OrientGraph>(this.graph, VertexIDType.NUMBER, buff);
bgraph.setVertexIdKey("id");
<parsing strings from CSV in id[0], id[1] and val - edge property>:
  Vertex[] vertices = new Vertex[2];
  for (int i=0;i<2;i++) {
    vertices[i] = bgraph.getVertex(id[i]);
    if (vertices[i]==null) vertices[i]=bgraph.addVertex(id[i]);
  }
  Edge edge = bgraph.addEdge(null, vertices[0], vertices[1], "Edges");
  edge.setProperty("val", val);

推荐答案

我认为在大约 1 分钟内完成导入的唯一方法是在 plocal 中工作:

I think the only way you have to do the import in ~1 min is to work in plocal:

 this.graph = new OrientGraph("plocal:/physical/path/to/db/dir", this.name, this.pass);

如果是一次性导入,则可以从java程序中执行,如果是重复操作并且需要在独立实例上运行,则可以定义服务器端函数来执行此操作并用插件公开它

If it's a one-shot import, you can just do it from a java program, if it's a recurring operation and you need it to run on a stand-alone instance, you can define a server-side function to do that and expose it with a plugin

http://orientdb.com/docs/2.0/orientdb.wiki/Extend-Server.html

这篇关于Orientdb最快批量导入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆