您如何找到文本所涉及的Wikidata(或Freebase或DBpedia)主题列表? [英] How do you find the list of wikidata (or freebase or DBpedia) topics that a text is about?

查看:129
本文介绍了您如何找到文本所涉及的Wikidata(或Freebase或DBpedia)主题列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种解决方案来提取文本(或html)文档所涉及的概念列表.我希望这些概念成为wikidata主题(或freebase或DBpedia).

I am looking for a solution to extract the list of concepts that a text (or html) document is about. I'd like the concepts to be wikidata topics (or freebase or DBpedia).

例如," Bad是Mikael Jackson的歌曲"应返回Michael Jackson(艺术家,Wikidata Q2831)和Bad(歌曲,Wikidata Q275422).如本例所示,该系统应能够应对拼写错误(Mikael)和歧义性(Bad).

For example "Bad is a song by Mikael Jackson" should return Michael Jackson (the artist, wikidata Q2831) and Bad (the song, wikidata Q275422). As this example shows, the system should be robust to spelling mistakes (Mikael) and ambiguity (Bad).

理想情况下,系统应可跨多种语言工作,它应同时适用于短文本和长文本,并且在不确定时应返回多个主题(例如,坏歌曲+坏专辑).此外,理想情况下,它应该是开源的并且具有python API.

Ideally the system should work across multiple languages, it should work both on short texts and long texts, and when it is unsure it should return multiple topics (eg. Bad song + Bad album). Also, it should ideally be open source and have a python API.

是的,这听起来像是圣诞老人的清单.有什么想法吗?

Yes, that sounds like a list for Santa Claus. Any ideas?

修改

我检查了一些解决方案,但到目前为止没有灵丹妙药.

I checked out a few solutions, but no silver bullet so far.

  • NLTK解析文本并提取命名实体"(AFAIU,这是引用名称的句子的一部分),但是它不返回Wikidata主题,而仅返回纯文本.这意味着它可能不会理解鲍勃·马利(Bob Marley)的一首歌是"我射中了警长",而是将其视为句子.
  • OpenNLP大致相同.
  • Wikidata具有搜索API,但一次仅是一个术语,并且不会处理歧义.
  • 有一些商业服务(OpenCalais,AlchemyAPI,CogitoAPI ...),但真的没什么亮色,恕我直言.
  • NLTK parses text and extract "named entities" (AFAIU, a part of a sentence that refers to a name), but it does not return Wikidata topics, just plain text. This means that it will likely not understand that "I shot the sheriff" is the name of a song by Bob Marley, it will instead treat this as a sentence.
  • OpenNLP does roughly the same.
  • Wikidata has a search API, but it's just one term at a time, and it does not handle disambiguation.
  • There are a few commercial services (OpenCalais, AlchemyAPI, CogitoAPI...) but none really shines, IMHO.

推荐答案

您可以使用Spacy检索命名实体,然后使用搜索API将它们链接到WikiData.

You can use Spacy to retrieve Named Entity then link them to WikiData using the search API.

对于Spacy未与命名实体匹配的句子的其余部分,您可以从句子中创建一个ngram列表,从使用WikiData搜索API查找WikiData主题的最大ngram开始.

For what remains of the sentence that is not matched as named entity by Spacy you can create a list of ngrams from the sentence starting with the biggest ngram you use the WikiData search API to lookup WikiData topics.

POS标记可以被很好地利用,因为您可以知道单词之间的关系,所以所说的语法解析信息更强大.例如,从链接语法给出的输出如下:

POS tagging can be put to good use, that said syntax parse informations is more powerful since you can know the relations between the words. For instance given the following output from link-grammar:

Found 8 linkages (8 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)

    +-------------------------Xp-------------------------+
    +----------->WV---------->+                          |
    +-------Wd------+         +---------Osn--------+     |
    |       +---G---+----Ss---+----Os----+         |     |
    |       |       |         |          |         |     |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] . 

您可以说主题是鲍勃·马利",因为

You can tell that the subject is "Bob Marley" because

  1. 写信"通过 S 将主题名词与有限动词联系起来.
  2. 使用 G 将专有名词连接在一起.
  1. "wrote" is connected to "Marley" with a S which connects subject nouns to finite verbs.
  2. "Marley" is connected to "Bob" using a G which connects proper noun together.

因此,鲍勃·马利(Bob Marley)"是实体的理想人选(两个单词都大写).

So a "Bob Marley" is a good candidate for an entity (also it has both word capitalized).

鉴于以上分析的树",即使它们位于句子的同一侧,也很难分辨自然"和神秘"是否相关.

Given the above parse "tree" it difficult to tell whether "Natural" and "Mystic" are related even if they are on the same side of the sentence.

链接语法提供的第二个解析具有相同的代价矢量,并将自然神秘主义者"与G再次链接在一起.

The second parse provided by link grammar has the same cost vector and links together "Natural Mystic" with again a G.

在这里:

    Linkage 2, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)

    +-------------------------Xp-------------------------+
    +----------->WV---------->+                          |
    +-------Wd------+         +---------Os---------+     |
    |       +---G---+----Ss---+          +----G----+     |
    |       |       |         |          |         |     |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] .

因此,在我看来,"Bob Marley"和"Natural Mystic"是进行Wikidata搜索的理想选择.

So in my opinion "Bob Marley" and "Natural Mystic" are good candidate for a wikidata search.

这是语法和拼写正确的简单问题.

That was the easy problem where grammar and spelling are correct.

以下是同一句子中11个词的一个小写解析:

Here is one parse out of 11 of the same sentence with lower cases:

Linkage 1, cost vector = (UNUSED=1 DIS= 0.15 LEN=14)

    +------------------------Xp------------------------+
    +----------------------Wa---------------------+    |
    |       +------------------AN-----------------+    |
    |       |        +-------------AN-------------+    |
    |       |        |                  +----AN---+    |
    |       |        |                  |         |    |
LEFT-WALL Bob.m marley[?].n [wrote] natural.n mystic.n . 

LG甚至都无法识别动词.

LG doesn't even recognize the verb.

这篇关于您如何找到文本所涉及的Wikidata(或Freebase或DBpedia)主题列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆