使用SPARQL查询与字符串的最佳匹配? [英] Query for best match to a string with SPARQL?
问题描述
我有一个包含电影标题的列表,想在 DBpedia 中查找有关导演的元信息。 。但是我很难用SPARQL识别正确的电影,因为标题有时不完全匹配。
I have a list with movie titles and want to look these up in DBpedia for meta information like "director". But I have trouble to identify the correct movie with SPARQL, because the titles sometimes don't exactly match.
如何获得 best 使用SPARQL匹配DBpedia的电影标题吗?
How can I get the best match for a movie title from DBpedia using SPARQL?
一些有问题的示例:
- 我的清单: Die Hard:复仇 vs. DBpedia: Die Hard带着复仇
- 我的清单: Hachi vs. DBpedia : Hachi:狗的故事
我当前的方法是查询 DBpedia端点(针对所有电影),然后通过检查单个标记(不带标点符号)进行过滤,按标题排序并返回第一个结果。例如:
My current approach is to query the DBpedia endpoint for all movies and then filter by checking for single tokens (without punctuations), order by title and return the first result. E.g.:
SELECT ?resource ?title ?director WHERE {
?resource foaf:name ?title .
?resource rdf:type schema:Movie .
?resource dbo:director ?director .
FILTER (
contains(lcase(str(?title)), "die") &&
contains(lcase(str(?title)),"hard")
)
}
ORDER BY (?title)
LIMIT 1
此方法非常慢,有时也失败,例如:
This approach is very slow and also sometimes fails, e.g.:
SELECT ?resource ?title ?director WHERE {
?resource foaf:name ?title .
?resource rdf:type schema:Movie .
?resource dbo:director ?director .
FILTER (
contains(lcase(str(?title)), "hachi")
)
}
ORDER BY (?title)
LIMIT 10
其中正确的结果排在第二位:
where the correct result is on second place:
resource title director
http://dbpedia.org/resource/Chachi_420 "Chachi 420"@en http://dbpedia.org/resource/Kamal_Haasan
http://dbpedia.org/resource/Hachi:_A_Dog's_Tale "Hachi: A Dog's Tale"@en http://dbpedia.org/resource/Lasse_Hallström
http://dbpedia.org/resource/Hachiko_Monogatari "Hachikō Monogatari"@en http://dbpedia.org/resource/Seijirō_Kōyama
http://dbpedia.org/resource/Thachiledathu_Chundan "Thachiledathu Chundan"@en http://dbpedia.org/resource/Shajoon_Kariyal
有什么想法可以解决这个问题吗?甚至更好:通常如何使用SPARQL查询与字符串的最佳匹配?
Any ideas how to solve this problem? Or even better: How to query for best matches to a string with SPARQL in general?
谢谢!
推荐答案
我修改了评论中提到的正则表达式方法 ,并提出了一个效果很好的解决方案,比我使用bif:contains所能获得的任何优势都要好:
I adapted the regex-approach mentioned in the comments and came up with a solution that works pretty well, better than anything I could get with bif:contains:
SELECT ?resource ?title ?match strlen(str(?title)) as ?lenTitle strlen(str(?match)) as ?lenMatch
WHERE {
?resource foaf:name ?title .
?resource rdf:type schema:Movie .
?resource dbo:director ?director .
bind( replace(LCASE(CONCAT('x',?title)), "^x(die)*(?:.*?(hard))*(?:.*?(with))*.*$", "$1$2$3") as ?match )
}
ORDER BY DESC(?lenMatch) ASC(?lenTitle)
LIMIT 5
这并不完美,因此我仍然愿意征求建议。
It's not perfect, so I'm still open for suggestions.
这篇关于使用SPARQL查询与字符串的最佳匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!