将大量节点插入 Neo4J [英] Inserting large number of nodes into Neo4J

查看:35
本文介绍了将大量节点插入 Neo4J的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个存储在典型 MySQL 数据库中的表,我已经使用 java 构建了一个小型解析器工具,以解析并构建一个 neo4j 数据库.该数据库将有大约 4000 万个节点,每个节点都有一个或多个边(最多可能有 10 个边).问题来自我必须创建某些节点的方式.有一个用户节点、评论节点和主题标签节点.用户节点和标签节点必须是唯一的.我正在使用以下示例中的代码来确保唯一性:

I have a table stored in a typical MySQL database and I've built a small parser tool using java, to parse out and build a neo4j database. This database will have ~40 million nodes, each with one or more edges (with a possible maximum of 10 edges). The problem comes from the way I have to create certain nodes. There's a user node, comment node, and hashtag node. The user nodes and hashtag nodes must each be unique. I'm using code from the following example to ensure uniqueness:

public Node getOrCreateUserWithUniqueFactory( String username, GraphDatabaseService graphDb )
{
    UniqueFactory<Node> factory = new UniqueFactory.UniqueNodeFactory( graphDb, "users" )
    {
    @Override
    protected void initialize( Node created, Map<String, Object> properties )
    {
        created.setProperty( "name", properties.get( "name" ) );
    }
};

return factory.getOrCreate( "name", username );

}

我曾考虑过使用批量插入器,但我还没有看到在执行批量插入时检查节点是否唯一的方法.所以我的问题是插入所有这些节点的最快方法是什么,同时仍然确保它们保持其唯一性.任何帮助将一如既往不胜感激.

I have thought about using the batch inserter but I haven't seen a way to check if a node is unique while performing a batch insert. So my question is what is the fastest way to insert all these nodes while still ensuring that they retain their uniqueness. Any help would as always be greatly appreciated.

推荐答案

如果这里的其他人遇到这个问题,我想记录下我自己和同事能够解决的问题,以提高速度.首先是关于数据的一两个注释:

In case anyone else here runs into this problem I want to document what myself and a coworker were able to figure out in order to increase speed. First off a note or two about the data:

  • 有大量用户,他们约占节点的30%
  • 还有大量的话题标签,因为人们倾向于对任何事情进行哈希
  • 必须保证这两者都是独一无二的

现在这已经不在优化中了.首先,您需要确保每次插入节点时都完成插入循环.没有真实的例子让我们看,所以最初的代码看起来像这样(伪代码)

Now that that's out of the way on to the optimizations. First and formost you need to ensure that your insert loop completes each time a node is inserted. There were no real examples of this for us to look at so intially the code looked like this (pseudo code)

Transaction begin
While(record.next()){
   parse record
   create unique user
   create unique hashtag
   create comment
   insert into graph
}
Transaction success
Transaction finish

虽然这对于小数据集来说运行良好并且完成得相对较快,但它的扩展性并不好.于是我们查看了每个函数的用途,将代码重构为如下所示:

While this worked ok and finished relatively quickly for small datasets it didn't scale well. So we took a look at the purpose of each function and refactored the code to look like the following:

While(record.next()){
   Transaction begin

   parse record
   create unique user
   create unique hashtag
   create comment
   insert into graph

   Transaction success
   Transaction finish
}

这大大加快了速度,但对我的同事来说还不够.所以他发现可以在节点属性上创建 Lucene 索引,我们可以在 Unique Node factory 中引用这些索引.这给了我们另一个显着的速度提升.如此之多,以至于我们可以在大约 10 秒内插入 1,000,000 个节点,而无需使用批处理加载器.感谢大家的帮助.

This greatly sped things up but it wasn't enough for my co-worker. So he found that Lucene indexes could be created on node attributes and that we could reference those in the Unique Node factory. This gave us another significant speed boost. So much so that we could insert 1,000,000 nodes in ~10 seconds without resorting to using the batch loader. Thanks to everyone for their help.

这篇关于将大量节点插入 Neo4J的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆