本地Blazegraph上的Wikidata:此处应有RDF值,请参见"[第1行] [英] Wikidata on local Blazegraph : Expected an RDF value here, found '' [line 1]

查看:144
本文介绍了本地Blazegraph上的Wikidata:此处应有RDF值,请参见"[第1行]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们(Thomas和Wolfgang)按照此处的说明在本地安装了wikidata和blazegraph: https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md

We (Thomas and Wolfgang) have installed locally wikidata and blazegraph following the instruction here : https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md

mvn package command was successful

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] parent ............................................. SUCCESS [ 54.103 s]
[INFO] Shared code ........................................ SUCCESS [ 23.085 s]
[INFO] Wikidata Query RDF Testing Tools ................... SUCCESS [ 11.698 s]
[INFO] Blazegraph extension to improve performance for Wikibase SUCCESS [02:12 min]
[INFO] Blazegraph Service Package ......................... SUCCESS [01:02 min]
[INFO] Wikidata Query RDF Tools ........................... SUCCESS [02:19 min]
[INFO] Wikibase RDF Query Service ......................... SUCCESS [ 25.466 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS

我们都在使用

java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

我们都下载了最新的all.ttl.gz,例如

We both downloaded the latest-all.ttl.gz e.g.

31064651574 Jan  3 19:30 latest-all.ttl.gz

来自 https://dumps.wikimedia.org/wikidatawiki/entities/ 花了大约4个小时.

from https://dumps.wikimedia.org/wikidatawiki/entities/ which took some 4 hours.

.munge在数据/拆分中创建了424个文件,名称为"wikidump-000000001.ttl.gz"

The .munge created 424 files as "wikidump-000000001.ttl.gz" in data/split

~/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT$ ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de 
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
08:23:02.391 [main] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz
08:24:21.249 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 10000 entities at (105, 47, 33)
08:25:07.369 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 20000 entities at (162, 70, 41)
08:25:56.862 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 30000 entities at (186, 91, 50)
08:26:43.594 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 40000 entities at (203, 109, 59)
08:27:24.042 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 50000 entities at (224, 126, 67)
08:28:00.770 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 60000 entities at (244, 142, 75)
08:28:32.670 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 70000 entities at (272, 161, 84)
08:29:12.529 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 80000 entities at (261, 172, 91)
08:29:47.764 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 90000 entities at (272, 184, 98)
08:30:20.254 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 100000 entities at (286, 196, 105)
08:30:20.256 [main] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000002.ttl.gz
08:30:55.058 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 110000 entities at (286, 206, 112)

当Thomas尝试使用blazegraph加载一个文件时

When Thomas tried to load one file on blazegraph with

./loadRestAPI.sh -n wdq -d data/split/wikidump-000000001.ttl.gz

他得到了下面的错误.尝试从blazegraph的UPDATE选项卡导入也没有用.

he got the error below. Trying to import from the UPDATE tab of blazegraph also didn't work.

该如何解决?

错误: uri = [文件:/home/tsc/projects/TestSPARQL/wikidata-query-rdf-0.2.1/dist/target/service-0.2.1/data/split/wikidump-000000001.ttl.gz],上下文uri = [] java.util.concurrent.ExecutionException: org.openrdf.rio.RDFParseException:此处应有RDF值 " [第1行]位于 java.util.concurrent.FutureTask.report(FutureTask.java:122)在 java.util.concurrent.FutureTask.get(FutureTask.java:192)在 com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:281) 在 com.bigdata.rdf.sail.webapp.InsertServlet.doPostWithURIs(InsertServlet.java:397) 在 com.bigdata.rdf.sail.webapp.InsertServlet.doPost(InsertServlet.java:116) 在 com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:303) 在 com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:192) 在javax.servlet.http.HttpServlet.service(HttpServlet.java:707)处 javax.servlet.http.HttpServlet.service(HttpServlet.java:790)在 org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:808) 在 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) 在 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) 在 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) 在 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) 在 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) 在 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) 在 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) 在 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) 在 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 在 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) 在 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) 在 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) 在org.eclipse.jetty.server.Server.handle(Server.java:497)处 org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)在 org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) 在 org.eclipse.jetty.io.AbstractConnection $ 2.run(AbstractConnection.java:540) 在 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) 在 org.eclipse.jetty.util.thread.QueuedThreadPool $ 3.run(QueuedThreadPool.java:555) 在java.lang.Thread.run(Thread.java:748)造成原因: org.openrdf.rio.RDFParseException:此处应有RDF值 " [第1行]位于 org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:441) 在 org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:671) 在 org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1306) 在 org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:637) 在 org.openrdf.rio.turtle.TurtleParser.parseSubject(TurtleParser.java:449) 在 org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:383) 在 org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261) 在org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:216) 在org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:159) 在 com.bigdata.rdf.sail.webapp.InsertServlet $ InsertWithURLsTask.call(InsertServlet.java:556) 在 com.bigdata.rdf.sail.webapp.InsertServlet $ InsertWithURLsTask.call(InsertServlet.java:414) 在 com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:68) 在java.util.concurrent.FutureTask.run(FutureTask.java:266)在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) ...还有1个

ERROR: uri=[file:/home/tsc/projects/TestSPARQL/wikidata-query-rdf-0.2.1/dist/target/service-0.2.1/data/split/wikidump-000000001.ttl.gz], context-uri=[] java.util.concurrent.ExecutionException: org.openrdf.rio.RDFParseException: Expected an RDF value here, found '' [line 1] at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:281) at com.bigdata.rdf.sail.webapp.InsertServlet.doPostWithURIs(InsertServlet.java:397) at com.bigdata.rdf.sail.webapp.InsertServlet.doPost(InsertServlet.java:116) at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:303) at com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:192) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:808) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:497) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:748) Caused by: org.openrdf.rio.RDFParseException: Expected an RDF value here, found '' [line 1] at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:441) at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:671) at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1306) at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:637) at org.openrdf.rio.turtle.TurtleParser.parseSubject(TurtleParser.java:449) at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:383) at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261) at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:216) at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:159) at com.bigdata.rdf.sail.webapp.InsertServlet$InsertWithURLsTask.call(InsertServlet.java:556) at com.bigdata.rdf.sail.webapp.InsertServlet$InsertWithURLsTask.call(InsertServlet.java:414) at com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:68) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more

推荐答案

loadRestAPI.sh脚本基本上是在以下内容中提到的脚本:

The loadRestAPI.sh script is basically the one mentioned in:

https://wiki.blazegraph.com/wiki/index.php/Bulk_Data_Load#Command_line

,因此应该可以直接使用命令行工具代替REST API.

so it should be possible to use the command line tool directly instead of the REST API.

整个过程似乎也很尴尬.该工具依赖于.gz文件,该文件比.bz2文件大25%,并且下载时间更长.解压缩.bz2文件比进行压缩过程更快.我的假设是要处理230GB的解压缩文件,例如

Also the whole process seems to be quite awkward. The tool is relying on the .gz file which is 25% bigger than the .bz2 file and takes longer to download. Unzipping the .bz2 file is quicker than the munge process. My assumption is that processing the unzipped 230GB file e.g.

230033083334 1月4日07:29 wikidata-20180101-all-BETA.ttl

230033083334 Jan 4 07:29 wikidata-20180101-all-BETA.ttl

以块式"方式可能会更好.但是首先,我们需要了解导致导入阻塞的原因.

in "chunk-wise" fashion might work better. But first we need to see what makes the import choke.

我的第一个问题是shell脚本runBlazegraph.sh针对缺少的mwservices.json给出了错误.

My first issue was that the shell script runBlazegraph.sh gave an error for the missing mwservices.json.

我假设使用 https://github之类的文件.预计会com/wikimedia/wikidata-query-deploy/blob/master/mwservices.json .

所以我尝试用

wget https://raw.githubusercontent.com/wikimedia/wikidata-query-deploy/master/mwservices.json

尽管我怀疑这是否有意义.

although I doubt this is of much relevance.

实际通话

./loadRestAPI.sh -n wdq -d data/split/wikidump-000000001.ttl.gz 
Loading with properties...
quiet=false
verbose=0
closure=false
durableQueues=true
#Needed for quads
#defaultGraph=
com.bigdata.rdf.store.DataLoader.flush=false
com.bigdata.rdf.store.DataLoader.bufferCapacity=100000
com.bigdata.rdf.store.DataLoader.queueCapacity=10
#Namespace to load
namespace=wdq
#Files to load
fileOrDirs=data/split/wikidump-000000001.ttl.gz
#Property file (if creating a new namespace)
propertyFile=/home/wf/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT/RWStore.properties
<?xml version="1.0"?><data modified="0" milliseconds="493832"/>DATALOADER-SERVLET: Loaded wdq with properties: /home/wf/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT/RWStore.properties

在具有Java 1.8.0_151的Ubuntu 16.04 LTS服务器上为我工作,因此我认为我们必须研究更多详细信息以解决Thomas的问题.

worked for me on an Ubuntu 16.04 LTS server with Java 1.8.0_151 so I believe we have to look into more details to fix Thomas' problem.

另请参见 https://wikitech.wikimedia.org/wiki/Nova_Resource :Wikidata-query/Documentation

了解更多详情.

要检查结果,我使用了一条到ubuntu服务器的ssh隧道

To check the results I used an ssh tunnel to my ubuntu server

ssh -L 9999:localhost:9999 user@server

然后

http://localhost:9999/bigdata/namespace/wdq/sparql

在我本地计算机的浏览器(笔记本电脑)浏览器中.

in the browser of my local machines (laptop) browser.

第二次导入也可以.

然后,我使用以下SPARQL查询检查了数据库内容:

Then I checked the database content with the following SPARQL query:

SELECT ?type (COUNT(?type) AS ?typecount)
WHERE {
  ?subject a ?type.
}
GROUP by ?type
ORDER by desc(?typecount)
LIMIT 7

给出结果

type                                          typecount
<http://wikiba.se/ontology#BestRank>            2938060
schema:Article                                  2419109
<http://wikiba.se/ontology#QuantityValue>.        78105
<http://wikiba.se/ontology#TimeValue>.            61553
<http://wikiba.se/ontology#GlobecoordinateValue>  57032
<http://wikiba.se/ontology#GeoAutoPrecision>       3462
<http://www.wikidata.org/prop/novalue/P17>.         531

鉴于导入经验,我想说munge和loadRestAPI调用可以并行运行,因为loadRestAPI步骤显然较慢.

given the import experience i would say that the munge and loadRestAPI calls can be run somewhat in parallel since the loadRestAPI step is apparently slower.

导入每个gz文件大约需要5分钟.稍后删除,某些文件实际上在Wolfgang的服务器上花费了1小时15分钟.

It takes some 5 minutes per gz file to import. This later drops and some files actually took up to 1 hour 15 mins on Wolfgang's server.

在Wolfgang的第一台计算机上加载所有数据可能需要10天或更长时间,因此请继续关注最终结果.

Loading all the data will probably take 10 or more days or more on Wolfgang's first machine so please stay tuned for the final result.

当前在158个小时后在这台计算机上导入了440个文件中的358个.目前wikidata.jnl文件的大小为250 GB,并且已导入了约17亿条语句.

Currently 358 of 440 files are imported after 158 hours on this machine. At this time the wikidata.jnl files is 250 GBytes big and some 1700 million statements have been imported.

加载统计信息非常尴尬.在Wolfgang的计算机上,加载* .ttl.gz文件之一需要87到11496秒的时间.目前平均为944秒.看起来在导入过程中的某些步骤中,每个gz文件的时间会增加,例如从805到4943秒或从4823到11496,这之后的时机似乎稳定在更高的水平,并回到了293或511秒.这种计时行为使得很难预测完全导入将花费多长时间.

The loading statistics are quite awkward. Loading one of the *.ttl.gz files takes anything from 87 to 11496 secs on Wolfgang's machine. The average is 944 secs at this time. It looks like at certain steps during the import the time per gz file goes way up e.g. from 805 to 4943 secs or from 4823 to 11496 - after that the timing seems to settle at a higher level and go back to as little as 293 or 511 secs. This timing behavior makes it very difficult to predict how long the full import will take.

鉴于装载过程花了很长时间,沃尔夫冈(Wolfgang)配置了第二台导入机,所以稍有不同.

Given that the loading took so long Wolfgang configured a second import machine slightly different.

  1. 机器:8核,56 GB RAM,6个Terrabyte 5.400 rpm硬盘
  2. 机器:8核,32 GB RAM,1 512 GB 7.200 rpm硬盘和1 480 GB SSD

第二台计算机的数据要导入到7.200 rpm的硬键上,而闪耀的日志文件放在SSD上.

the second machine has the data to be imported on the 7.200 rpm hardisk and the blazegraph journal file on the SSD.

第二次计算机导入在3.8天的导入完成后显示了更好的计时行为,具有以下统计信息:

The second machines import shows a better timing behavior after 3.8 days the import had finished with the following statistics:

    |  sum d |   sum h |         mins |         secs |
----+--------+---------+--------------+--------------+
MIN |  0.0 d |   0.0 h |     1.2 mins |      74 secs |      
MAX |  0.0 d |   1.1 h |    64.4 mins |    3863 secs |
AVG |  0.0 d |   0.2 h |    12.3 mins |     738 secs | 
TOT |  3.8 d |  90.2 h |  5414.6 mins |  324878 secs |

第一台机器在10天后仍未完成

the first machine is still not finished after 10 days

SUM | 10.5 d | 252.6 h | 15154.7 mins |  909281 secs |
----+--------+---------+--------------+--------------+
MIN |  0.0 d |   0.0 h |     1.5 mins |      87 secs |
MAX |  0.3 d |   7.3 h |   440.5 mins |   26428 secs |
AVG |  0.0 d |   0.6 h |    36.4 mins |    2185 secs |
TOT | 11.1 d | 267.1 h | 16029.0 mins |  961739 secs |
----+--------+---------+--------------+--------------+
ETA |  0.6 d |  14.6 h |   874.3 mins |   52458 secs |

这篇关于本地Blazegraph上的Wikidata:此处应有RDF值,请参见"[第1行]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆