将大量节点插入Neo4J [英] Inserting large number of nodes into Neo4J

查看:134
本文介绍了将大量节点插入Neo4J的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个存储在典型MySQL数据库中的表,我使用java构建了一个小的解析器工具,以解析并构建一个neo4j数据库。该数据库将具有约4,000万个节点,每个节点具有一个或多个边缘(最多可能有10个边缘)。问题来自我必须创建某些节点的方式。有一个用户节点,注释节点和hashtag节点。用户节点和主题标签节点必须都是唯一的。我正在使用以下示例中的代码来确保唯一性:

I have a table stored in a typical MySQL database and I've built a small parser tool using java, to parse out and build a neo4j database. This database will have ~40 million nodes, each with one or more edges (with a possible maximum of 10 edges). The problem comes from the way I have to create certain nodes. There's a user node, comment node, and hashtag node. The user nodes and hashtag nodes must each be unique. I'm using code from the following example to ensure uniqueness:

public Node getOrCreateUserWithUniqueFactory( String username, GraphDatabaseService graphDb )
{
    UniqueFactory<Node> factory = new UniqueFactory.UniqueNodeFactory( graphDb, "users" )
    {
    @Override
    protected void initialize( Node created, Map<String, Object> properties )
    {
        created.setProperty( "name", properties.get( "name" ) );
    }
};

return factory.getOrCreate( "name", username );

}

我考虑过使用批量插入器但我还没有看到在执行批量插入时检查节点是否唯一的方法。所以我的问题是什么是插入所有这些节点的最快方法,同时仍然确保它们保持其唯一性。总是非常感谢任何帮助。

I have thought about using the batch inserter but I haven't seen a way to check if a node is unique while performing a batch insert. So my question is what is the fastest way to insert all these nodes while still ensuring that they retain their uniqueness. Any help would as always be greatly appreciated.

推荐答案

如果其他人遇到这个问题,我想记录我自己和一位同事能够弄明白,以提高速度。首先关于数据的一两个说明:

In case anyone else here runs into this problem I want to document what myself and a coworker were able to figure out in order to increase speed. First off a note or two about the data:


  • 他们占据了大约30%的节点的大量用户

  • 还有大量的标签,因为人们会倾向于散列任何东西

  • 这两个都必须保证唯一

现在这已经超出了优化的范围。首先,您需要确保每次插入节点时插入循环都完成。我们没有真正的例子可以看到这样的代码看起来像这样(伪代码)

Now that that's out of the way on to the optimizations. First and formost you need to ensure that your insert loop completes each time a node is inserted. There were no real examples of this for us to look at so intially the code looked like this (pseudo code)

Transaction begin
While(record.next()){
   parse record
   create unique user
   create unique hashtag
   create comment
   insert into graph
}
Transaction success
Transaction finish

虽然这项工作正常并且相对较快完成小数据集,它不能很好地扩展。所以我们看一下每个函数的用途,并重构代码如下所示:

While this worked ok and finished relatively quickly for small datasets it didn't scale well. So we took a look at the purpose of each function and refactored the code to look like the following:

While(record.next()){
   Transaction begin

   parse record
   create unique user
   create unique hashtag
   create comment
   insert into graph

   Transaction success
   Transaction finish
}

这大大加快了事情,但对我的同事来说还不够。因此他发现可以在节点属性上创建Lucene索引,并且我们可以在Unique Node工厂中引用它们。这给了我们另一个显着的速度提升。这么多,我们可以在~10秒内插入1,000,000个节点,而无需使用批量加载器。感谢大家的帮助。

This greatly sped things up but it wasn't enough for my co-worker. So he found that Lucene indexes could be created on node attributes and that we could reference those in the Unique Node factory. This gave us another significant speed boost. So much so that we could insert 1,000,000 nodes in ~10 seconds without resorting to using the batch loader. Thanks to everyone for their help.

这篇关于将大量节点插入Neo4J的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆