如何使用RDFLib解析大型数据集? [英] how to parse big datasets using RDFLib?

查看:400
本文介绍了如何使用RDFLib解析大型数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用RDFLib 3.0解析几张大图,显然它可以处理第一个,而在第二个上就死了(MemoryError)...好像不再支持MySQL作为存储,请您提出一种方法解析那些?

I'm trying to parse several big graphs with RDFLib 3.0, apparently it handles first one and dies on the second (MemoryError)... looks like MySQL is not supported as store anymore, can you please suggest a way to somehow parse those?

Traceback (most recent call last):
  File "names.py", line 152, in <module>
    main()
  File "names.py", line 91, in main
    locals()[graphname].parse(filename, format="nt")
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 938, in parse
    location=location, file=file, data=data, **args)
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 757, in parse
    parser.parse(source, self, **args)
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/nt.py", line 24, in parse
    parser.parse(f)
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 124, in parse
    self.line = self.readline()
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 151, in readline
    m = r_line.match(self.buffer)
MemoryError

推荐答案

这些RDF文件上有多少个三元组?我已经测试过rdflib,它的扩展范围不会超过几十个ktriples-如果您很幸运的话.对于具有数百万个三元组的文件,它确实无法很好地发挥作用.

How many triples on those RDF files ? I have tested rdflib and it won't scale much further than few tens of ktriples - if you are lucky. No way it really performs well for files with millions of triples.

最好的解析器是 Redland Libraries 中的rapper.我的第一个建议是不要使用RDF/XML并选择ntriples. Ntriples是一种比RDF/XML更轻的格式.您可以使用rapper:

The best parser out there is rapper from Redland Libraries. My first advice is to not use RDF/XML and go for ntriples. Ntriples is a lighter format than RDF/XML. You can transform from RDF/XML to ntriples using rapper:

rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples

如果您喜欢Python,则可以使用 Redland python绑定:

If you like Python you can use the Redland python bindings:

import RDF
parser=RDF.Parser(name="ntriples")
model=RDF.Model()
stream=parser.parse_into_model(model,"file://file_path",
                                      "http://your_base_uri.org")
for triple in model:
    print triple.subject, triple.predicate, triple.object

我已经使用redland库解析了相当大的文件(几个千兆字节),没有问题.

I have parsed fairly big files (couple of gigabyes) with redland libraries with no problem.

最终,如果要处理大型数据集,则可能需要将数据声明到可伸缩的三重存储中,我通常使用的是 4store . 4store内部使用redland解析RDF文件.从长远来看,我认为,您将要做的是可扩展的三重存储.有了它,您就可以使用 SPARQL 来查询您的数据并 SPARQL/Update 来插入和删除三元组.

Eventually if you are handling big datasets you might need to assert your data into a scalable triple store, the one I normally use is 4store. 4store internally uses redland to parse RDF files. In the long term, I think, going for a scalable triple store is what you'll have to do. And with it you'll be able to use SPARQL to query your data and SPARQL/Update to insert and delete triples.

这篇关于如何使用RDFLib解析大型数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆