在 Python 中解析大型 RDF [英] Parse large RDF in Python

查看:54
本文介绍了在 Python 中解析大型 RDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 python 中解析一个非常大(大约 200MB)的 RDF 文件.我应该使用 sax 还是其他一些库?我很感激一些我可以构建的非常基本的代码,比如检索标签.

I'd like to parse a very large (about 200MB) RDF file in python. Should I be using sax or some other library? I'd appreciate some very basic code that I can build on, say to retrieve a tag.

提前致谢.

推荐答案

如果您正在寻找快速的性能,那么我建议您使用 RaptorRedland Python 绑定.用 C 编写的 Raptor 的性能比 RDFLib 好得多.如果你不想处理 C,你可以使用 python 绑定.

If you are looking for fast performance then I'd recommend you to use Raptor with the Redland Python Bindings. The performance of Raptor, written in C, is way better than RDFLib. And you can use the python bindings in case you don't want to deal with C.

另一个提高性能的建议,忘记解析 RDF/XML,使用其他风格的 RDF,如 Turtle 或 NTriples.特别是解析 ntriples 比解析 RDF/XML 快得多.这是因为 ntriples 语法更简单.

Another advice for improving performance, forget about parsing RDF/XML, go with other flavor of RDF like Turtle or NTriples. Specially parsing ntriples is much faster than parsing RDF/XML. This is because the ntriples syntax is simpler.

您可以使用 rapper(raptor 附带的工具)将 RDF/XML 转换为 ntriples:

You can transform your RDF/XML into ntriples using rapper, a tool that comes with raptor:

rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples

ntriples 文件将包含三元组,例如:

The ntriples file will contain triples like:

<s1> <p> <o> .
<s2> <p2> "literal" .

和解析器在处理这种结构时往往非常有效.此外,内存比 RDF/XML 更有效,因为正如您所见,这种数据结构更小.

and parsers tend to be very efficient handling this structure. Moreover, memory wise is more efficient than RDF/XML because, as you can see, this data structure is smaller.

下面的代码是一个使用 redland python 绑定的简单示例:

The code below is a simple example using the redland python bindings:

import RDF
parser=RDF.Parser(name="ntriples") #as name for parser you can use ntriples, turtle, rdfxml, ...
model=RDF.Model()
stream=parser.parse_into_model(model,"file://file_path","http://your_base_uri.org")
for triple in model:
    print triple.subject, triple.predicate, triple.object

基本 URI 是带前缀的 URI,以防您在 RDF 文档中使用相对 URI.您可以在 此处

The base URI is the prefixed URI in case you use relative URIs inside your RDF document. You can check documentation about the Python Redland bindings API in here

如果您不太关心性能,那么使用 RDFLib,它既简单又易于使用.

If you don't care much about performance then use RDFLib, it is simple and easy to use.

这篇关于在 Python 中解析大型 RDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆