使用 SPARQL 查询开放数据社区数据 [英] Querying Open Data Communities Data with SPARQL

查看:45
本文介绍了使用 SPARQL 查询开放数据社区数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 下层超级输出区域(LSOA)英国邮政编码 数据集.

I'm trying to get some information from the Lower Layer Super Output Areas (LSOAs) and UK Postcodes datasets.

我需要数据转储中的邮政编码和 lsoa 信息以供 excel 使用.

I need the postal code and lsoa information in a data dump for excel use.

下层超级输出区域"类型的符号和标签.http://opendatacommunities.org/doc/geography/lsoa/E01009437

Notation and Label of type 'Lower Layer Super Output Area'. http://opendatacommunities.org/doc/geography/lsoa/E01009437

例如'lsoa' 每种类型的邮政编码单位"http://opendatacommunities.org/resource?uri=http%3A%2F%2Fdata.ordnancesurvey.co.uk%2Fid%2Fpostcodeunit%2FB721NB

E.g. 'lsoa' per each type 'Postcode Unit' http://opendatacommunities.org/resource?uri=http%3A%2F%2Fdata.ordnancesurvey.co.uk%2Fid%2Fpostcodeunit%2FB721NB

我不知道如何使用站点上的 SPARQL 引擎来获取这些信息,或者如何从我下载的 N-Triples 文件中提取信息……

I have no idea how to use the SPARQL engine on the site to get this information, or how to extract the information from the N-Triples file I downloaded…

推荐答案

有两个主要选项可用于检索您想要的数据.在某些情况下,可以使用公开可用的 SPARQL 端点查询数据.这可能是最方便的方法,除非有明确的原因需要本地数据,否则应采用这种方法.然而,这种方法有局限性,在这些情况下,下载数据集并在本地查询它是有意义的.我将首先描述远程端点解决方案,然后是使用本地查询的解决方案.SPARQL 端点的限制(例如,硬超时)意味着第一种方法不足以完成此特定任务,因此该问题的具体答案是第二种方法.

There are two main options for retrieving the data you want. In some cases, it is possible to query the data using a publicly available SPARQL endpoint. This is probably the most convenient approach, and the one to take unless there's some definite reason that you need the data locally. There are limitations to this approach, however, and in those cases, it makes sense to download the dataset and query against it locally. I'll describe the remote endpoint solution first, and then the solution using local queries. The limitations on the SPARQL endpoint (e.g., hard timeouts) mean that the first approach isn't sufficient for this particular task, so the specific answer to this question is the second approach.

在提出这个问题之前,我并不熟悉这些特定的数据集和本体,因此第一种方法也经过了熟悉数据"的过程.

I wasn't familiar with these particular datasets and ontologies before this question, so the first approach also walks though the "getting familiar with the data" process.

有一个 开放数据社区 SPARQL 端点,您可以针对它运行查询并获取一些数据.我之前没有查看过这些数据,因此我将不只是发布最终答案,而是介绍我过去用于确定要运行哪种查询的过程.

There is a Open Data Communities SPARQL endpoint against which you can run queries and get some data out. I haven't looked at this data before, so rather than just posting the final answer, I'll walk through the process that I used to figure out what sort of query to run.

您链接到的页面之一,B72 1NB,提到资源类型为 PostcodeUnit,具有 URI

One of the pages you linked to, B72 1NB, mentions that the resource has type PostcodeUnit, which has the URI

http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit

基于此,我尝试的第一件事是使用 SPARQL 查询来尝试检索一些邮政编码单元,因此我在上面的端点中使用了以下查询.(如果您将其复制并粘贴到那里,则需要删除 SELECT 之前的所有前导空格.无论如何,我必须这样做.)

Based on this, the first thing I tried was a SPARQL query to try to retrieve some postcode units, so I used the following query in the endpoint above. (If you copy and paste it in there, you'll need to remove any leading space before SELECT. I had to do that, anyhow.)

SELECT * WHERE { 
  ?postcodeUnit a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit>
}
LIMIT 10

SPARQL 结果

在上面链接的端点中.(LIMIT 有助于确保结果及时返回,并且我们不会要求服务器做太多事情.)这会产生类似

in the endpoint linked above. (The LIMIT helps ensure that the results come back in a timely manner, and that we're not asking the server to do too much.) This produces results like

--------------------------------------------------------------
| postcodeUnit                                               |
==============================================================
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY256SA> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ>  |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY209DR> |
--------------------------------------------------------------

B72 1NB 页面将其 lsoa 显示为 伯明翰 006C.lsoa 属性的 IRI 是(您可以在下载的数据中看到这一点)

The B72 1NB page shows its lsoa as Birmingham 006C. The IRI for the lsoa property is (and you can see this in the data you downloaded)

http://opendatacommunities.org/def/geography#lsoa

所以我们将 SPARQL 查询扩展为

so we extend the SPARQL query to

SELECT * WHERE { 
  ?postcodeUnit
    a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit> ;
    <http://opendatacommunities.org/def/geography#lsoa> ?lsoa .
}
LIMIT 10

SPARQL 结果

结果是这样的:

-----------------------------------------------------------------------------------------------------------------------------
| postcodeUnit                                               | lsoa                                                         |
=============================================================================================================================
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> | <http://opendatacommunities.org/id/geography/lsoa/E01029309> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> | <http://opendatacommunities.org/id/geography/lsoa/E01029706> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> | <http://opendatacommunities.org/id/geography/lsoa/E01018373> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> | <http://opendatacommunities.org/id/geography/lsoa/E01014172> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> | <http://opendatacommunities.org/id/geography/lsoa/E01018514> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> | <http://opendatacommunities.org/id/geography/lsoa/E01029175> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ>  | <http://opendatacommunities.org/id/geography/lsoa/E01014204> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> | <http://opendatacommunities.org/id/geography/lsoa/E01029225> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SW65TP>  | <http://opendatacommunities.org/id/geography/lsoa/E01001950> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF15AX>  | <http://opendatacommunities.org/id/geography/lsoa/E01014155> |
-----------------------------------------------------------------------------------------------------------------------------

如果您想让查询更具可读性和简洁性,您可以在查询中使用前缀:

You can use prefixes in your query if you want to make it a bit more readable and concise:

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
SELECT * WHERE { 
  ?postcodeUnit
    a pc:PostcodeUnit ;
    geo:lsoa ?lsoa .
}
LIMIT 10

SPARQL 结果

当然,结果是一样的.在每个结果页面的底部,您可以下载多种其他格式的结果.其中一种格式是 CSV,您可能幸运地将其直接导入到电子表格中(您说过要使用 Excel 中的数据).

The results will be the same, of course. At the bottom of each of those results pages, you can download the results in a number of other formats. One of the formats is CSV, and you might have luck importing that directly into a spreadsheet (you said you wanted to use the data in Excel).

评论中的讨论指出,PostcodeUnit 的绝对数量使得结果集非常大.UK Postcodes 数据集包含四种类型的资源,按大小顺序排列:邮政编码单位、邮政编码部门、邮政编码区和邮政编码区.这些类型的资源分别有1686911、10833、2087和120个.据我了解评论中的澄清,想法是将这些与下层超级输出区域 (LSOA) 相关联,例如 伯明翰 006C.单个邮政编码单元与 LSOA 相关联,但更高级别的邮政编码区域则没有.每个邮政编码单位属于其部门、地区和区域.例如,TA21 9HB 在 TA、TA21 9 和 TA21 内.使用此信息,我们可以询问邮政编码单位及其相应的地区(或部门或地区),以及它们的 LSOA,并仅报告地区和 LSOA,而忽略单位本身.例如:

Discussion in the comments pointed out that the sheer number of PostcodeUnits makes the result set very large. The UK Postcodes dataset contains four types of resources, in order of increasing size: Postcode Units, Postcode Sectors, Postcode Districts, and Postcode Areas. There are 1686911, 10833, 2087, and 120 resources of these types, respectively. As I understand the clarification in the comments, the idea is to associate these with Lower Layer Super Output Areas (LSOAs), e.g., Birmingham 006C. Individual Postcode Units are associated with LSOAs, but the higher level postcode regions are not. Each Postcode Unit is within its sector, district, and area. For instance, TA21 9HB is within TA, TA21 9, and TA21. Using this information, we can ask for postcode units and their corresponding district (or sector, or area), as well as their LSOA, and report just the district and the LSOA, ignoring the unit itself. For instance:

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
SELECT DISTINCT ?district ?lsoa 
WHERE { 
  ?postcodeunit a pc:PostcodeUnit ;
                geo:lsoa ?lsoa ;
                sr:within ?district .
  ?district a pc:PostcodeDistrict .
}
LIMIT 10 

SPARQL 结果

现在,有 34378 LSOAs 在数据集中,因此仍有大量数据需要选择,并且尝试下拉所有不同 losa/地区映射的文本结果仍然会导致超时.事实上,由于每个 LSOA 都与(我期望)某个地区相关联,因此输出中的结果可能与 LSOA 的数量一样多.

Now, there are 34378 LSOAs in the dataset, so there's still lots of data to be selected, and trying to pull down the text results for all distinct losa/district mappings still results in a timeout. In fact, since every LSOA is associated (I expect) with some district, there are probably as many results in the output as there are LSOAs.

看起来这是我们开始遇到响应大小限制和超时的点 对于 SPARQL 端点,需要在本地开始访问数据.邮政编码数据本身就是 5.6 GB,所以这不是一个很好的解决方案.

It looks like this is the point where we start to hit response size limits and timeouts for the SPARQL endpoint, and need to start accessing the data locally. The postcode data alone is 5.6 GB though, so this isn't a wonderful solution.

但是,如果您愿意为每个地区采用具有代表性的 LSOA,我们可以使用 SPARQL 子查询来提取它们,如下面的查询首先检索所有邮政编码地区,然后为每个地区找到一个single 区某邮编单位有的LSOA.不知道这是不是可以接受的结果,但是你最终每个区都有一个LSOA,而且结果足够小(有2087行,和区数一样)可以拉下来任何结果格式(包括 CSV).

But, if you're willing to take a representative LSOA for each district, we can use SPARQL subqueries to pull these out, as in the following query which first retrieves all the postcode districts, and then for each one, finds a single LSOA that some postcode unit in the district has. I don't know whether this is an acceptable result, but you end up with an LSOA for each district, and the results are small enough (there are 2087 rows, the same as the number of districts) that they can be pulled down in any of the results formats (including CSV).

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
SELECT ?region ?lsoa 
WHERE { 
  {
    SELECT ?region WHERE { 
      ?region a pc:PostcodeDistrict .
    }
  }

  {
    SELECT ?lsoa WHERE { 
      ?postcodeunit a pc:PostcodeUnit ;
                    geo:lsoa ?lsoa ;
                    sr:within ?region .
    }
    LIMIT 1 
  }
}

SPARQL 结果

使用 SPARQL 端点存在一些限制,例如上面遇到的超时.在这些情况下,下载数据并将其放入 Jena TDB 存储并使用 tdbquery 进行查询并不太难.英国邮政编码页面有压缩的 n-triples 的下载链接.下载此数据后(并安装了 Apache Jena 2.10),我运行(在 Unix 系统上):

There are limitations to using the SPARQL endpoint such as the timeouts encountered above. In these situations, it's not too hard to download the data and get it into a Jena TDB store and to query using tdbquery. The UK postcodes page has the download link for zipped n-triples. After downloading this data, (and having Apache Jena 2.10 installed), I ran (on a Unix system):

$ tdbloader2 --loc tdb dataset_data_postcodes_20130506183000.nt

其中 tdb 是我用来包含 TDB 索引的本地目录.加载数据需要一段时间(此处为 1125 秒),索引也是如此.加载完所有内容后,我将以下查询存储在名为 postcodes.sparql 的文件中,并使用

where tdb is a local directory I make to contain TDB's indexes. Loading the data takes a while (1125 seconds here), as does indexing. Once everything is loaded up, I stored the following query in a file named postcodes.sparql, and ran the query with

$ tdbquery --loc tdb --results CSV --query postcodes.sparql > unit_lsoa.csv

生成 CSV 格式的结果,存储在文件 unit_lsoa.csv 中.以下是前几行:

to generate results in CSV format, stored in the file unit_lsoa.csv. Here are the first few lines:

$ head -5 unit_lsoa.csv 
postcodeUnit,lsoa
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AE,http://opendatacommunities.org/id/geography/lsoa/E01023667
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AG,http://opendatacommunities.org/id/geography/lsoa/E01023741
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AJ,http://opendatacommunities.org/id/geography/lsoa/E01023741
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AR,http://opendatacommunities.org/id/geography/lsoa/E01023684

现在,定义了 1686911 个邮政编码单元,所以我最初预计 unit_lsoa.csv 中的行数会相同.但是,少了大约 200,000.(wc -l 打印文件中的行数.)

Now, there were 1686911 defined postcode units, so I initially expected that there would be the same number of lines in unit_lsoa.csv. However, there are about 200,000 fewer. (wc -l prints the number of lines in a file.)

$ wc -l unit_lsoa.csv 
1440143 unit_lsoa.csv

事实证明,一些邮政编码单元没有关联的 LSOA.我通过运行查询来检查这一点

As it turns out, some of the postcode units do not have associated LSOAs. I checked this by running the query

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
SELECT * WHERE { 
  ?postcodeUnit
    a pc:PostcodeUnit .
    FILTER NOT EXISTS { ?postcodeUnit geo:lsoa ?lsoa }
}

存储在文件postcodes_without_lsoa.sparql中:

$ tdbquery --loc tdb \
    --results CSV \
    --query postcodes_without_lsoa.sparql > unit_without_lsoa.csv

果然在unit_without_lsoa.csv中大概有20万行:

Sure enough, there are about 200,000 lines in unit_without_lsoa.csv:

$ wc -l unit_without_lsoa.csv
246770 unit_without_lsoa.csv

1440143 和 246770 的总和是 1686913,这正是邮政编码的数量(每个 CSV 文件中的标题加上 2 行).任务完成!

The sum of 1440143 and 246770 is 1686913 which is exactly the number of postcodes (plus 2 lines for the headers in each CSV file). Mission accomplished!

这篇关于使用 SPARQL 查询开放数据社区数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆