如何读取 Freebase RDF 数据?好像有点破 [英] How to read Freebase RDF data? It seems to be a bit broken

查看:40
本文介绍了如何读取 Freebase RDF 数据?好像有点破的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 https://developers 上遵循使用 Freebase RDF 解析 ruby​​ 的说明.google.com/freebase/v1/rdf-overview

环境:rdf-1.1.6,rdf-turtle-1.1.4,ruby-2.1.4[x86_64],Ubuntu 14.10

Environment: rdf-1.1.6, rdf-turtle-1.1.4, ruby-2.1.4[ x86_64 ], Ubuntu 14.10

我的代码是:

require 'rubygems'
require 'cgi'
require 'addressable/uri'
require 'rdf'
require 'rdf/turtle'

topic_id = '/m/0d6lp'
url = Addressable::URI.parse('https://www.googleapis.com/freebase/v1/rdf' + topic_id)

RDF::Turtle::Reader.open(url) do |reader|
  reader.each_statement do |statement|
    puts statement.inspect
  end
end

我收到错误:

ERROR [line: 131] With input '"Cidade e Condado de S\xe3o Francisco"@pt;
    ns:common.topic.alias    "City and County of San Franc': Invalid token "\"Cidade" (found "\"Cidade"), production = :_predicateObjectList_5
ERROR [line: 131] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"City and County of San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 132] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"SF\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 133] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"Frisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 134] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"The City by the Bay\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 135] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco, Kalifornija\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 136] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 137] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco, Calif.\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 138] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"City by the Bay - San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 139] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"La Ciutat i el Comtat de San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 140] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"Yerba Buena\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 141] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"La ciutat i comtat de San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 142] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Franciskas\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 143] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco, Kalifornija\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 144] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"旧金山\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 145] Expected one of [:IRIREF, :BLANK_NODE_LABEL, :ANON, "(", "[", :PNAME_LN, :PNAME_NS, :INTEGER, :DECIMAL, :DOUBLE, "true", "false", :STRING_LITERAL_QUOTE, :STRING_LITERAL_SINGLE_QUOTE, :STRING_LITERAL_LONG_SINGLE_QUOTE, :STRING_LITERAL_LONG_QUOTE] (found ";"), production = :objectList
ERROR [line: 146] Expected one of [:IRIREF, :BLANK_NODE_LABEL, :ANON, "(", "[", :PNAME_LN, :PNAME_NS, :INTEGER, :DECIMAL, :DOUBLE, "true", "false", :STRING_LITERAL_QUOTE, :STRING_LITERAL_SINGLE_QUOTE, :STRING_LITERAL_LONG_SINGLE_QUOTE, :STRING_LITERAL_LONG_QUOTE] (found ";"), production = :objectList
ERROR [line: 147] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"Bandar raya dan Daerah San Francisco merupakan bandar raya keempat paling ramai penduduk di California dan keempat belas di Amerika Syarikat, dengan anggaran penduduk seramai 744,041 pada 2006. Ia terletak di hujung Semenanjung San Francisco dan merupakan titik fokus kewangan, kebudayaan serta pengangkutan kawasan metropolitan San Francisco Bay Area. San Francisco merupakan bandar utama kedua paling padat di Amerika Syarikat.\nPada 1776, orang-orang Sepanyol menduduki hujung semenanjung San Francisco, dan mendirikan sebuah kubu dan misi di Golden Gate. Kerubut Emas California pada 1848 mendorong bandar ini berkembang dengan pesat. Selepas dimusnahkan dalam Gempa bumi San Francisco 1906, San Francisco telah dibina semula dengan cepat.\nSemasa tahun 1960-an, kawasan Haight-Ashbury di San Francisco menjadi terkenal apabila menjadi pusat budaya hippie apabila ribuan golongan muda dan seniman bermigrasi ke lokasi tersebut. Walaupun Haight-Ashbury telah mengalami gentrifikasi dan hilang identiti budaya hippie dalam dekad-dekad berikutnya, San Francisco telah menjadi sinonim dengan budaya dan nostalgia hippie.\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 148] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"ซานฟรานซิสโก หรือ แซนแฟรนซิสโก คือเมืองในรัฐแคลิฟอร์เนีย สหรัฐอเมริกา มีประชากร ประมาณ 808,976 คน ซึ่งเป็นเมืองที่มีความหนาแน่นประชากรเป็นอันดับสองของประเทศ เมืองซานฟรานซิสโกตั้งอยู่บริเวณอ่าวซานฟรานซิสโก\nชาวยุโรปกลุ่มแรกที่มาตั้งรกรากในซานฟรานซิสโกคือชาวสเปน โดยในปี ค.ศ. 1776 เมืองมีชื่อว่า เซนต์ฟรานซิส ในภายหลังจากช่วงยุคตื่นทองในปี ค.ศ. 1848 ทำให้ประชากรในซานฟรานซิสโกเพิ่มขึ้นอย่างรวดเร็ว และเมืองเติบโตอย่างมาก ถึงแม้ว่าซานฟรานซิสโกจะประสบปัญหา แผ่นดินไหวและไฟไหม้ขนาดใหญ่ในช่วงปี ค.ศ. 1906 ซานฟรานซิสโกกลับฟื้นตัวได้อย่างรวดเร็ว และได้ชื่อว่าเป็นเมืองสำคัญเมืองหนึ่งในแถบชายฝั่งตะวันตกของประเทศ\nซานฟรานซิสโกมีลักษณะภูมิประเทศที่เป็นเขา และมีชายฝั่งติดกับมหาสมุทรแปซิฟิก สัญลักษณ์ที่ขึ้นชื่อของเมืองซานฟรานซิสโกได้แก่ สะพานโกลเดนเกต และแหล่งท่องเที่ยวที่มีชื่อเสียงได้แก่ เกาะอัลคาทราซ รถรางซานฟรานซิสโก Pier 39 และ ถนนลอมบาร์ด ทีมกีฬา อเมริกันฟุตบอล ที่สำคัญได้แก่ ซานฟรานซิสโก 49ers เป็นเมืองเศรษฐกิจที่มีขนาดใหญ่ และชาวเอเชียอาศัยที่อ่าวซานฟรานซิโกเป็นจำนวนมาก\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 149] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco je četrto največje mesto v Kaliforniji; hkrati je tudi okrožje. Ocena prebivalcev iz leta 2004 je 744.230.\nSamo mesto leži na skrajnem delu polotoka San Francisco, hkrati pa zajema več otokov v zalivu San Francisca in ožini Golden Gate.\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 150] With input '"\uc0cc\ud504\ub780\uc2dc\uc2a4\ucf54\ub294 \ubbf8\uad6d \uce98\ub9ac\ud3ec\ub2c8\uc544 \uc8fc \uc911': Invalid token "\"\\uc0cc\\ud504\\ub780\\uc2dc\\uc2a4\\ucf54\\ub294" (found "\"\\uc0cc\\ud504\\ub780\\uc2dc\\uc2a4\\ucf54\\ub294"), production = :_triples_1
ERROR [line: 150] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"Сан-Франциско — місто на західному узбережжі США у штаті Каліфорнія, порт, осередок індустрії й торгівлі, осередок багатьох дослідницьких інститутів, зокрема Каліфорнійського університету та Університету штату Каліфорнія, населення 805 000 мешканців, близько 3 000 українців.\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 151] With input '"\u820a\u91d1\u5c71\uff0c\u6b63\u5f0f\u540d\u7a31\u70ba\u820a\u91d1\u5c71\u5e02\u90e1\uff0c\u662f\u7f': Invalid token "\"\\u820a\\u91d1\\u5c71\\uff0c\\u6b63\\u5f0f\\u540d\\u7a31\\u70ba\\u820a\\u91d1\\u5c71\\u5e02\\u90e1\\uff0c\\u662f\\u7f" (found "\"\\u820a\\u91d1\\u5c71\\uff0c\\u6b63\\u5f0f\\u540d\\u7a31\\u70ba\\u820a\\u91d1\\u5c71\\u5e02\\u90e1\\uff0c\\u662f\\u7f8e\\u570b\\u52a0\\u5229\\u798f\\u5c3c\\u4e9e\\u5dde\\u5317\\u90e8\\u7684\\u4e00\\u5ea7\\u90fd\\u5e02\\uff0c\\u4e5f\\u662f\\u52a0\\u5dde\\u552f\\u4e00\\u5e02\\u90e1\\u5408\\u4e00\\u7684\\u884c\\u653f\\u5340\\uff0c\\u4e2d\\u6587\\u53c8\\u97f3\\u8b6f\\u70ba\\u4e09\\u85e9\\u5e02\\u548c\\u8056\\xb7\\u5f17\\u6717\\u897f\\u65af\\u79d1\\uff0c\\u4ea6\\u5225\\u540d\\u300c\\u91d1\\u9580\\u57ce\\u5e02\\u300d\\u3001\\u300c\\u7063\\u908a\\u4e4b\\u57ce\\u300d\\u3001\\u300c\\u9727\\u57ce\\u300d\\u7b49\\u3002\\u4f4d\\u65bc\\u820a\\u91d1\\u5c71\\u534a\\u5cf6\\u7684\\u5317\\u7aef\\uff0c\\u6771\\u81e8\\u820a\\u91d1\\u5c71\\u7063\\u3001\\u897f\\u81e8\\u592a\\u5e73\\u6d0b\\uff0c\\u4eba\\u53e3\\u7d0483\\u842c\\uff0c\\u70ba\\u52a0\\u5dde\\u7b2c\\u56db\\u5927\\u57ce\\u3002\\u5176\\u8207\\u5357\\u908a\\u7684\\u8056\\u99ac\\u5201\\u90e1\\u3001\\u5357\\u7063\\u7684\\u8056\\u8377\\u897f\\u8207\\u77fd\\u8c37\\u5730\\u5340\\u3001\\u6771\\u7063\\u7684\\u5967\\u514b\\u862d\\u8207\\u67cf\\u514b\\u840a\\u3001\\u4ee5\\u53ca\\u5317\\u908a\\u7684\\u99ac\\u6797\\u90e1\\u8207\\u7d0d\\u5e15\\u90e1\\u5408\\u7a31\\u70ba\\u820a\\u91d1\\u5c71\\u7063\\u5340\\u3002\\n\\u820a\\u91d1\\u5c71\\u662f\\u5317\\u52a0\\u5dde\\u8207\\u820a\\u91d1\\u5c71\\u7063\\u5340\\u7684\\u5546\\u696d\\u8207\\u6587\\u5316\\u767c\\u5c55\\u4e2d\\u5fc3\\uff0c\\u7576\\u5730\\u4f4f\\u6709\\u5f88\\u591a\\u85dd\\u8853\\u5bb6\\u3001\\u4f5c\\u5bb6\\u548c\\u6f14\\u54e1\\uff0c\\u572820\\u4e16\\u7d00\\u53ca21\\u4e16\\u7d00\\u521d\\u4e00\\u76f4\\u662f\\u7f8e\\u570b\\u563b\\u76ae\\u6587\\u5316\\u548c\\u8fd1\\u4ee3\\u81ea\\u7531\\u4e3b\\u7fa9\\u3001\\u9032\\u6b65\\u4e3b\\u7fa9\\u7684\\u4e2d\\u5fc3\\u4e4b\\u4e00\\u3002\\u9019\\u500b\\u57ce\\u5e02\\u540c\\u6a23\\u4ee5\\u5176\\u773e\\u591a\\u7684\\u7db2\\u969b\\u7db2\\u8def\\u516c\\u53f8\\u800c\\u805e\\u540d\\uff0c\\u540c\\u6642\\u4e5f\\u6210\\u70ba\\u4e86\\u5ee3\\u5927\\u540c\\u6027\\u6200\\u8005\\u7684\\u805a\\u5c45\\u5730\\u3002\\u820a\\u91d1\\u5c71\\u4e5f\\u662f\\u53d7\\u6b61\\u8fce\\u7684\\u65c5\\u904a\\u76ee\\u7684\\u5730\\uff0c\\u8207\\u5176\\u6dbc\\u723d\\u7684\\u590f\\u5b63\\u3001\\u591a\\u9727\\u3001\\u7dbf\\u5ef6\\u7684\\u4e18\\u9675\\u5730\\u5f62\\u3001\\u6df7\\u5408\\u7684\\u5efa\\u7bc9\\u98a8\\u683c\\uff0c\\u548c\\u91d1\\u9580\\u5927\\u6a4b\\u3001\\u7e9c\\u8eca\\u3001\\u60e1\\u9b54\\u5cf6\\u76e3\\u7344\\u53ca\\u4e2d\\u570b\\u57ce\\u7b49\\u666f\\u9ede\\u805e\\u540d\\u3002\\u6b64\\u5916\\uff0c\\u820a\\u91d1\\u5c71\\u4e5f\\u662f\\u4e94\\u5927\\u4e3b\\u8981\\u9280\\u884c\\u548c\\u8a31\\u591a\\u5927\\u578b\\u516c\\u53f8\\u7684\\u7e3d\\u90e8\\u6240\\u5728\\uff0c\\u5305\\u62ec\\u84cb\\u749e\\u3001\\u592a\\u5e73\\u6d0b\\u74e6\\u96fb\\u516c\\u53f8\\u3001Yelp\\u3001Pinterest\\u3001Twitter\\u3001\\u512a\\u6b65\\u3001Mozilla\\u548cCraigslist\\u7b49\\u3002\"@zh-TW;"), production = :_triples_1
ERROR [line: 151] With input '"San Francisco er en amerikansk by i delstaten Californien. Byen er med sine 837.442 indbyggere Calif': Invalid token "\"San" (found "\"San"), production = :_triples_1
ERROR [line: 151] With input '"San Francisco is een stad in de Amerikaanse staat Californi\xeb en het hart van de San Francisco Bay': Invalid token "\"San" (found "\"San"), production = :_triples_1
ERROR [line: 151] undefined prefix "county"
ERROR [line: 151] With input 'officieel heet ze City and County of San Francisco.\nDe stad, die 805.235 inwoners telt, ligt op het ': Invalid token "officieel" (found "officieel"), production = :_triples_1
ERROR [line: 151] With input '"San Francisco, ofici\xe1ln\u011b: M\u011bsto a Okres San Francisco, je velk\xe9 m\u011bsto na z\xe1p': Invalid token "\"San" (found "\"San"), production = :_triples_1
ERROR [line: 151] With input 'nejlidnat\u011bj\u0161\xedm m\u011bstem st\xe1tu Kalifornie a 14. nejlidnat\u011bj\u0161\xedm m\u011b': Invalid token "nejlidnat\\u011bj\\u0161\\xedm" (found "nejlidnat\\u011bj\\u0161\\xedm"), production = :_turtleDoc_1

我也尝试过 Python 和 C# RDF Turtle 库 - 他们都抱怨 \x 我试图手动修复它,将字符串中的 \x 替换为 \u00,但随后它开始抱怨长字符串文字中未转义的双引号.

I also tried Python and C# RDF Turtle libraries - all of them complain about \x I tried to fix it manually replacing \x to \u00 in the string, but then it starts complain about unescaped double quotes in long string literals.

上面的错误是我在使用官方 Google 代码示例时遇到的错误.

The error above is the error I got using official Google code example.

Freebase RDF 损坏了吗?我写错了吗?如何以正确的方式处理 Freebase RDF?

Is Freebase RDF broken? Do I write something wrong? How to handle Freebase RDF in the right way?

谢谢.

推荐答案

Jena 解析 freebase RDF 转储问题(2014 年 1 月)中所述),Freebas 转储并不总是合法的 Turtle/N3.在此,您从

As discussed in Jena parsing issue for freebase RDF dump (Jan 2014), the Freebas dumps aren't always legal Turtle/N3. In this, you're grabbing the data from

当我尝试使用 Jena 解析它时,出现此错误:

When I try to parse that with Jena, I get this error:

09:24:01 ERROR riot                 :: [line: 131, col: 54] illegal escape sequence value: x (0x78)
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 131, col: 54] illegal escape sequence value: x (0x78)

正如您所指出的,当前的问题是 Turtle 字符串不应该有 \x 转义符.Turtle 支持几种不同类型的转义(参见 § 6.4 转义序列) 并且看起来这些应该是 \uXXXX (或 \uXXXXXXXX) 的形式.第 131 行是:

As you noted, the immediate issue is that Turtle strings shouldn't have \x escapes. Turtle supports a few different kinds of escapes (see § 6.4 Escape Sequences) and it looks like these ought to be of the form \uXXXX (or \uXXXXXXXX). Line 131 is:

    ns:common.topic.alias    "Cidade e Condado de S\xe3o Francisco"@pt;

我们可以通过用\u00 替换\x 来修复它,所以我们最终得到\u00e3.果然,我们可以把它解析成一个单独的文件:

We can fix it by replacing the \x with \u00, so we end up with \u00e3. Sure enough, we can parse this as a separate file:

[] <ns:common.topic.alias> "Cidade e Condado de S\u00e3o Francisco"@pt .

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:j.0="ns:">
  <rdf:Description>
    <j.0:common.topic.alias xml:lang="pt">Cidade e Condado de São Francisco</j.0:common.topic.alias>
  </rdf:Description>
</rdf:RDF>

您可以尝试用\u00全局替换\x,但这并不能解决文件中的所有问题.在那之后,我结束了

You can try globally replacing \x with \u00, but that won't fix all the problems in the file. After that, I end up with

09:34:27 ERROR riot                 :: [line: 170, col: 653] Unknown char: \(92;0x005C)
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 170, col: 653] Unknown char: \(92;0x005C)

这不是一个特别有用的错误,但我认为这里发生的事情是第 170 行是这样的(我已经替换了合法的 \uXXXX 转义符):

That's not a particularly helpful error, but I think what's going on here is that line 170 is like this (where I've replaced legal \uXXXX escapes):

ns:common.topic.description     "&hellip;"\u05e8. &hellip;"@iw;

猜测第二个引号应该被转义,但由于它不是,它被视为字符串的结尾.这意味着读取的下一个字符是 \ 来自 \u05e8,而 \ 在该位置没有意义(逗号、分号、at 符号、抑扬符或点都有意义).

I'd guess that the that second quotation mark should be escaped, but since it's not, it's seen as the end of a string. That means that the next character read is \ from \u05e8, and \ doesn't make sense in that location (either a comma, semicolon, at-sign, circumflex, or dot would make sense).

我终于得到了一个可以在执行一些转换后解析的版本,但这些显然有点临时.

I finally got a version of this that I could parse after peforming a few transformations, but these are obviously a bit ad hoc.

  1. 用 \x00 替换所有的 \x.
  2. 由于看起来每行只有一个字符串,因此将行中的第一个 " 替换为 """,并将行中的最后一个 " 替换为 """.这意味着 " ... " ... " 变成了 """ ... " ... """,这是合法的.
  3. 一堆名字中有美元符号,我认为这是非法的.我替换了它们DOLLAR.但这并不好,因为在某些地方,$,例如 \x,应该替换为 \u.例如:

  1. Replace all \x with \x00.
  2. Since it appears that there's just one string per line, replace the first " on a line with """, and replace the last " on a line with """. This means that " ... " ... " gets turned into """ ... " ... """, which is legal.
  3. There are dollar signs in a bunch of the names, and offhand I think that's illegal. I replaced them them DOLLAR. This isn't good though, because in some places $, like \x, should be replaced by \u. E.g.:

key:wikipedia.ca    "San_Francisco_$0028Calif$00F2rnia$0029";
ns:type.object.key    ns:wikipedia.fr.San_Francisco_$0028Californie$0029;

因此,结果不是,但可以解析.我用 sed 生成它:

So, the result isn't good, but it can be parsed. I generated it with sed:

sed -r -e 's/\\x/\\u00/g ; s/^([^"]*)"/\1"""/ ; s/"([^"]*)$/"""\1/ ; s/[$]/DOLLAR/g' 0d6lp.ttl

这篇关于如何读取 Freebase RDF 数据?好像有点破的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆