使用Py2neo将大型xml文件导入Neo4j [英] Importing a large xml file to Neo4j with Py2neo

查看:318
本文介绍了使用Py2neo将大型xml文件导入Neo4j的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在导入带有36196662行的非常大的XML文件时,我遇到了问题.我正在尝试使用 Py2neo 创建此XML文件的 Neo4j图形数据库,我的xml文件如下所示:

http://imgur.com/pLylHeG

和我将xml数据导入Neo4j的python代码是这样的:

from xml.dom import minidom
from py2neo import Graph, Node, Relationship, authenticate
from py2neo.packages.httpstream import http
http.socket_timeout = 9999
import codecs

authenticate("localhost:7474", "neo4j", "******")

graph = Graph("http://localhost:7474/db/data/")

xml_file = codecs.open("User_profilesL2T1.xml","r", encoding="latin-1")

xml_doc = minidom.parseString (codecs.encode (xml_file.read(), "utf-8"))

#xml_doc = minidom.parse(xml_file)
persons = xml_doc.getElementsByTagName('user')
label1 = "USER"

# Adding Nodes
for person in persons:


    if person.getElementsByTagName("id")[0].firstChild:
       Id_User=person.getElementsByTagName("id")[0].firstChild.data
    else: 
       Name="NO ID"
    print ("******************************USER***************************************")
    print(Id_User)



    print ("*************************")
    if person.getElementsByTagName("name")[0].firstChild:
       Name=person.getElementsByTagName("name")[0].firstChild.data
    else: 
       Name="NO NAME"   
   # print("Name :",Name)


    print ("*************************")
    if person.getElementsByTagName("screen_name")[0].firstChild:
       Screen_name=person.getElementsByTagName("screen_name")[0].firstChild.data
    else: 
       Screen_name="NO SCREEN_NAME" 
  #   print("Screen Name :",Screen_name)

    print ("*************************") 
    if person.getElementsByTagName("location")[0].firstChild:
       Location=person.getElementsByTagName("location")[0].firstChild.data
    else: 
       Location="NO Location"   
 #    print("Location :",Location)


    print ("*************************")
    if person.getElementsByTagName("description")[0].firstChild:
       Description=person.getElementsByTagName("description")[0].firstChild.data
    else: 
       Description="NO description" 
  #   print("Description :",Description)


    print ("*************************") 
    if person.getElementsByTagName("profile_image_url")[0].firstChild:
       Profile_image_url=person.getElementsByTagName("profile_image_url")[0].firstChild.data
    else: 
       Profile_image_url="NO profile_image_url" 
   # print("Profile_image_url :",Profile_image_url)

    print ("*************************")
    if person.getElementsByTagName("friends_count")[0].firstChild:
       Friends_count=person.getElementsByTagName("friends_count")[0].firstChild.data
    else: 
       Friends_count="NO friends_count" 
 #    print("Friends_count :",Friends_count)


    print ("*************************")
    if person.getElementsByTagName("url")[0].firstChild:
       URL=person.getElementsByTagName("url")[0].firstChild.data
    else: 
       URL="NO URL" 
  #   print("URL :",URL)






    node1 = Node(label1,ID_USER=Id_User,NAME=Name,SCREEN_NAME=Screen_name,LOCATION=Location,DESCRIPTION=Description,Profile_Image_Url=Profile_image_url,Friends_Count=Friends_count,URL=URL)
    graph.merge(node1)  

我的问题是,当我运行代码时,要花几乎一个星期的时间才能导入此文件,所以如果有人能帮助我以比我更快的速度导入数据,我将不胜感激.

NB:我的笔记本电脑配置为:4Gb RAM,500Gb硬盘,i5

解决方案

我认为您应该使用流解析器,否则甚至在python端您可能会在内存中溢出.

我还建议您在Neo4j中进行事务处理,每笔事务进行1万至10万次更新.

不要存储"NO xxxx"字段,只需将其保留即可,这只会浪费空间和精力.

我不知道merge(node)如何工作.我建议在:User(userId)上创建一个唯一约束,并使用如下所示的密码查询:

UNWIND {data} as row
MERGE (u:User {userId: row.userId}) ON CREATE SET u += {row}

其中{data}参数是具有属性的字典列表(例如1万个条目).

I have a problem in importing a very big XML file with 36196662 lines. I am trying to create a Neo4j Graph Database of this XML file with Py2neo my xml file look like that:

http://imgur.com/pLylHeG

and My python code to import the xml data into Neo4j is like that:

from xml.dom import minidom
from py2neo import Graph, Node, Relationship, authenticate
from py2neo.packages.httpstream import http
http.socket_timeout = 9999
import codecs

authenticate("localhost:7474", "neo4j", "******")

graph = Graph("http://localhost:7474/db/data/")

xml_file = codecs.open("User_profilesL2T1.xml","r", encoding="latin-1")

xml_doc = minidom.parseString (codecs.encode (xml_file.read(), "utf-8"))

#xml_doc = minidom.parse(xml_file)
persons = xml_doc.getElementsByTagName('user')
label1 = "USER"

# Adding Nodes
for person in persons:


    if person.getElementsByTagName("id")[0].firstChild:
       Id_User=person.getElementsByTagName("id")[0].firstChild.data
    else: 
       Name="NO ID"
    print ("******************************USER***************************************")
    print(Id_User)



    print ("*************************")
    if person.getElementsByTagName("name")[0].firstChild:
       Name=person.getElementsByTagName("name")[0].firstChild.data
    else: 
       Name="NO NAME"   
   # print("Name :",Name)


    print ("*************************")
    if person.getElementsByTagName("screen_name")[0].firstChild:
       Screen_name=person.getElementsByTagName("screen_name")[0].firstChild.data
    else: 
       Screen_name="NO SCREEN_NAME" 
  #   print("Screen Name :",Screen_name)

    print ("*************************") 
    if person.getElementsByTagName("location")[0].firstChild:
       Location=person.getElementsByTagName("location")[0].firstChild.data
    else: 
       Location="NO Location"   
 #    print("Location :",Location)


    print ("*************************")
    if person.getElementsByTagName("description")[0].firstChild:
       Description=person.getElementsByTagName("description")[0].firstChild.data
    else: 
       Description="NO description" 
  #   print("Description :",Description)


    print ("*************************") 
    if person.getElementsByTagName("profile_image_url")[0].firstChild:
       Profile_image_url=person.getElementsByTagName("profile_image_url")[0].firstChild.data
    else: 
       Profile_image_url="NO profile_image_url" 
   # print("Profile_image_url :",Profile_image_url)

    print ("*************************")
    if person.getElementsByTagName("friends_count")[0].firstChild:
       Friends_count=person.getElementsByTagName("friends_count")[0].firstChild.data
    else: 
       Friends_count="NO friends_count" 
 #    print("Friends_count :",Friends_count)


    print ("*************************")
    if person.getElementsByTagName("url")[0].firstChild:
       URL=person.getElementsByTagName("url")[0].firstChild.data
    else: 
       URL="NO URL" 
  #   print("URL :",URL)






    node1 = Node(label1,ID_USER=Id_User,NAME=Name,SCREEN_NAME=Screen_name,LOCATION=Location,DESCRIPTION=Description,Profile_Image_Url=Profile_image_url,Friends_Count=Friends_count,URL=URL)
    graph.merge(node1)  

My problem is when i run the code, it's take a long time to import this file almost a week to do that, so if can anyone help me to import data more faster than that i will be very grateful.

NB: My laptop configuration is: 4Gb RAM, 500Gb Hard Disc, i5

解决方案

I think you should use a streaming parser, otherwise it might be even on the python side that you overflow on memory.

Also I recommend doing transactions in Neo4j with batches of 10k to 100k updates per transaction.

Don't store "NO xxxx" fields, just leave them off it is just a waste of space and effort.

I don't know how merge(node) works. I recommend creating a unique constraint on :User(userId) and using a cypher query like this:

UNWIND {data} as row
MERGE (u:User {userId: row.userId}) ON CREATE SET u += {row}

where {data} parameter is a list (e.g. 10k entries) of dictionaries with the properties.

这篇关于使用Py2neo将大型xml文件导入Neo4j的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆