Freebase RDF转储的Jena解析问题(2014年1月) [英] Jena parsing issue for freebase RDF dump (Jan 2014)
问题描述
我正在尝试使用耶拿(Jena)解析freebase转储文件freebase-rdf-2014-01-12-00-00.gz(25 GB).
耶拿(Jena)报告了许多有关不良数据的问题.
示例-150.0无效,true和false值无效
通过在转储文件中的小数点和true/false周围加双引号解决了这些问题.
但是问题仍然是reported by Jena.(current - org.apache.jena.riot.RiotException: [line: 161083, col: 110] Illegal object: [MINUS])
I am trying to parse freebase dump file freebase-rdf-2014-01-12-00-00.gz (25 GB) using Jena.
There has been many issues reported by Jena regarding bad data.
Example - 150.0 not valid,true and false values not valid
These issues I have resolved by adding double quotes around decimal and true/false in dump file.
However issues are still being reported by Jena.(current - org.apache.jena.riot.RiotException: [line: 161083, col: 110] Illegal object: [MINUS])
有什么方法可以预处理这些数据,这样我就不必一一解决每个问题了. 我的Java代码:
Is there any way to pre process this data so that I don't have to fix each issues one by one. My Java Code :
// Open TDB dataset
String directory = "D:/test_dump";
Dataset dataset = TDBFactory.createDataset(directory);
// Assume we want the default model, or we could get a named model here
Model tdb = dataset.getDefaultModel();
// Read the input file - only needs to be done once
String source = "D:/test_dump/fixed-freebase-second-rdf.gz";
FileManager.get().readModel( tdb, source, "N-TRIPLES" );
推荐答案
数据为Turtle格式,而不是N-Triples.他们使用各种Turtle缩写(例如true
表示"true"^^xsd:boolean
或数字-27
表示"-27"^^xsd:integer
).
The data is in Turtle format, not N-Triples. They use various Turtle abbreviations (like true
for "true"^^xsd:boolean
or number -27
for "-27"^^xsd:integer
).
由于转储还包含非法语法,例如,可能仍然存在错误.在前缀名称中使用$
而不使用必需的\
There may still be errors as their dumps have also contained illegal syntax e.g. use of $
in prefix names without the necessary \
在RDF 更改周围添加引号.
Adding quotes around things changes the RDF.
这篇关于Freebase RDF转储的Jena解析问题(2014年1月)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!