R:本体和网络提取的数据结构 [英] R: Data structure for a ontology and web extraction

查看:197
本文介绍了R:本体和网络提取的数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从大型网站中提取信息并生成本体。可以使用描述逻辑处理的东西。



提取的html数据是什么数据结构?



我的想法:

- 使用数据框架,表结构

- 集合和关系(设置包和良好的关系)

- 图表





在结尾我想导出数据并计划使用另一种编程语言使用谓词逻辑(或描述逻辑)处理它。



我想使用R从html页面提取信息。但是据了解,对于谓词逻辑或RDF / OWL,R(或包)中没有直接支持。



所以我需要做提取,使用一些数据结构在此过程中并导出数据。



示例数据:

  SomeDocument rdf:type PDFDocument 
PDFDocument rdfs:subClassOf Document
SomeDocument isUsedAt DepartmentA

DepartmentA rdf:type Department
PersonA rdf:type Person
PersonA headOf DepartmentA

PersonA hasNameJohn

实例数据为SomeDocument ,DepartmentA和PersonA。





如果有意义,某种推理但可能不在R):

  AccessedOften(SomeDocument)=>重要文件(SomeDocument)


解决方案

最重要的是你的网站数据看起来像?例如,如果已经有RDFa,您将使用RDFa蒸馏器来获取RDF;简单;完成了那么你可以把RDF推到一个三重店里。您可以通过创建自己的本体来增加网站的数据,您可以使用SPARQL查询您的本体,如果您的本体与您在网站上发现的数据相同,那么您是金色的。许多三重存储可以通过URL单独作为SPARQL端点进行查询,并以XML的格式返回,因此即使R没有SPARQL或OWL托管包本身,也并不意味着您根本无法查询数据。 p>

I want to extract information from a large website and generate an ontology. Something that can be processed with description logic.

What data structure is advisable for the extracted html data?

My ideas yet:
- Use Data Frames, Table Structures
- Sets and Relations (sets package and good relations)
- Graphs

.

In the End I want to export the data and plan to process it with predicate logic (or description logic) using another programming language.

I want to use R to extraction information from html pages. But as I understand there is no direct support in R (or packages) for predicate logic or RDF/OWL.

So I need to do the extraction, use some data structure in the process and export the data.

Example Data:

SomeDocument rdf:type PDFDocument
PDFDocument rdfs:subClassOf Document
SomeDocument isUsedAt DepartmentA

DepartmentA rdf:type Department
PersonA rdf:type Person
PersonA headOf DepartmentA

PersonA hasName "John"

Where the instance data is "SomeDocument", "DepartmentA" and "PersonA".

.

If it makes sense, some sort of reasoning (but probably not in R):

AccessedOften(SomeDocument) => ImportantDocument(SomeDocument)

解决方案

Most important is what does your website data look like? For instance, if it already has RDFa in it you would use an RDFa distiller to get the RDF out; simple; done. Then you could shove the RDF into a triple store. You could augment the website's data by creating your own ontology which you would query using SPARQL, if your ontology make equivalent classes to the data you found on your web site then you are golden. Many triple stores can be queried as SPARQL endpoints via URLs alone, and return in format of XML so even if R has no SPARQL or OWL ontolgoy packages per se, it doesn't mean you can't query the data at all.

这篇关于R:本体和网络提取的数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆