使用张量表示几个 RDF 三元组.如何使用 Python 对这个建模过程进行编程? [英] Representing a couple of RDF-triples using tensor. How to programming this modeling process using Python?

查看:82
本文介绍了使用张量表示几个 RDF 三元组.如何使用 Python 对这个建模过程进行编程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于使用张量表示几个 RDF 三元组的问题.

A question about representing a couple of RDF-triples using tensor.

场景:

A RDF-triple 用于表达关于资源、格式(主语、谓语、宾语).

A RDF-triple is used to express simple statements about resources, formatting (subject, predicate, object).

假设我有两个谓词,一个是play_for,另一个是race_for,每个都包含n个三元组,如下:

Suppose I have two predicates, one is play_for, the other is race_for, each of which contains n triples, as follows:

第一个谓词:play_for;n个三元组:(雷阿伦,play_for,波士顿凯尔特人队),(科比布莱恩特,play_for,湖人队),......简而言之,(A_i, play for, T_i) for i =1 to n.

1-st predicate: play_for; n triples: (Ray Allen, play_for, Boston Celtics), (Kobe Bryant, play_for, Lakers), ... For short, (A_i, play for, T_i) for i =1 to n.

第二个谓词:race_for;n个三元组:(波士顿凯尔特人队,race_for,NBA 总冠军),(湖人队,race_for,NBA 总冠军),...简而言之,(T_i, Race for, NBA) i=1 to n.

2-rd predicate: race_for; n triples: (Boston Celtics, race_for, NBA championship), (Lakers, race_for, NBA championship), ... For short, (T_i, race for, NBA) for i=1 to n.

张量表示是对这 2n 个三元组进行建模的一种方法.我正在学习 Maximilian Nickel 的论文 以使用张量分解以找到数据集的潜在语义结构.第一步是使用张量表示数据集.

Tensor representation is one way to modeling this 2n triples. I'm studying Maximilian Nickel's paper to use tensor factorization to find the latent semantic structure of a dataset. And the first step is to represent the dataset using tensor.

张量条目 X_ijk = 1 表示存在关系(第 i 个实体、第 k 个谓词、第 j 个实体)的事实.否则,对于不存在和未知的关系,条目设置为零.例如,这 2n 个三元组可以用张量建模为:

A tensor entry X_ijk = 1 denotes the fact that there exists a relation (i-th entity, k-th predicate, j-th entity). Otherwise, for non-existing and unknown relations, the entry is set to zero. For instance, this 2n triples can be modeled by a tensor as:

 One slice:  (A_i, play for, T_i)

       A1, A2,...,An, T1, T2,...,Tn, NBA
 A1    0    0      0   1   0      0    0
 A2    0    0      0   0   1      0    0
 :
 An    0    0      0   0   0      1    0
 T1    0    0      0   0   0      0    0
 T2    0    0      0   0   0      0    0
 :
 Tn    0    0      0   0   0      0    0
 NBA   0    0      0   0   0      0    0

 The other slice: (T_i, race for, NBA)

      A1,  A2,...,An, T1, T2,...,Tn, NBA
 A1    0    0      0   0   0      0    0
 A2    0    0      0   0   0      0    0
 :
 An    0    0      0   0   0      0    0
 T1    0    0      0   0   0      0    1
 T2    0    0      0   0   0      0    1
 :
 Tn    0    0      0   0   0      0    1
 NBA   0    0      0   0   0      0    0

假设 RDF 三元组存储在test.txt"中.我的问题是如何使用 Python 对这个建模过程进行编程.

Assume the RDF-triples is stored in 'test.txt'. My question is how to programming this modeling process using Python.

这是我的想法:

最难的是如何得到张量中非零位置对应的RDF-triple的坐标.首先,这是一个包含所有实体的列表:

The most difficult thing is how to get the coordinate of the RDF-triple corresponding to the position of non-zeros in the tensor. At first, here is a list containing all entities:

T = ['A1',...,'An','T1',...'Tn','NBA']

对于数据集中的每个 RDF-triple (Subject_i, Predicate_k, Object_j),都有一个坐标 (i,j,k) 描述 X_ijk = 1 在张量中的位置.例如,现有的 RDF-triple (A_i, play for, T_i) 的坐标是 (5, 1, 13),这意味着在第一个切片矩阵中 X(5,13) = 1.但是,我不知道如何获得这个坐标.我应该使用字典来存储三元组吗?

For every RDF-triple (Subject_i, Predicate_k, Object_j) in the dataset, there is a coordinate (i,j,k) describe the position of X_ijk = 1 in a tensor. For instance, The coordinate of a existing RDF-triple (A_i, play for, T_i) is (5, 1, 13), which means X(5,13) = 1 in the first slice matrix. However, I don't know how to get this coordinate. Should I use dictionary to store the triple?

我对 Python 不是很熟悉,我已经尝试过解决方案,但我不知道如何解决它.任何帮助将不胜感激.

I don't quite familiar with Python, and I've tried to get the solution, but I have no idea about how to solve it. Any help would be greatly appreciated.

为了简洁和可读性,我删除了 RDF 的描述.

For brevity and readability, I have deleted the description of RDF.

推荐答案

python 最好的 rdf 库工具是 rdflib 一个 rdflib 图有一个方法

pythons best library tool for rdf is rdflib An rdflib graph has a method of

lst = myGraph.subject_objects(MyNS.race_for)
# which is just syntactic sugar for:
lst = myGraph.triples((None,MyNS.race_for,None))

您也可以在其他语言(如 Java-jena 等)的其他库中找到的第二种语法

The second syntax you also find in other libraries in other languages like Java-jena etc

在 scipy 中,您应该调用 sparse 并将其用于您的稀疏二进制数组.

Within scipy you should call sparse and use that for your sparse binary array.

查看 numpy 包,了解分解"从三元组查询返回的主题和对象的最佳方式.应该很简单.在 pandas 中有用于此的库,但我的猜测是您将拥有大型稀疏矩阵,并且最好使用scipy.sparse 模块.

Look at the numpy packages for your best way to "factorize" the subjects and objects returns from the triples query. should be pretty simple. There are libraries for this in pandas but my guess is that you will have large sparse matrices and you are better off with the scipy.sparse module.

这篇关于使用张量表示几个 RDF 三元组.如何使用 Python 对这个建模过程进行编程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆