匹配至少有一个共同词的字符串 [英] Matching strings with at least one word in common

查看:89
本文介绍了匹配至少有一个共同词的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在查询以获取具有特定标题的文档的URI.我的查询是:

I'm making a query to get the URIs of documents, that have a specific title. My query is:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?document WHERE {
  ?document dc:title ?title.
  FILTER (?title = "…" ).
}

其中"…"实际上是this.getTitle()的值,因为查询字符串是由以下方式生成的:

where "…" is actually the value of this.getTitle(), since the query string is generated by:

String queryString = "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> " +
                "PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?document WHERE { " +
                "?document dc:title ?title." +
                "FILTER (?title = \"" + this.getTitle() + "\" ). }";

通过上面的查询,我仅获得标题与this.getTitle()完全相同的文档.想象一下this.getTitle是由1个以上的单词组成的.我想获得文档,即使文档标题上仅出现一个形成this.getTitle的单词(例如).我该怎么办?

With the query above, I get only the documents with titles exactly like this.getTitle(). Imagine this.getTitle is formed by more than 1 word. I'd like to get documents even if only one word forming this.getTitle appears on the document title (for example). How could I do that?

推荐答案

假设您有一些数据(在Turtle中):

Let's say you've got some data like (in Turtle):

@prefix : <http://stackoverflow.com/q/20203733/1281433> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .

:a dc:title "Great Gatsby" .
:b dc:title "Boring Gatsby" .
:c dc:title "Great Expectations" .
:d dc:title "The Great Muppet Caper" .

然后您可以使用以下查询:

Then you can use a query like:

prefix : <http://stackoverflow.com/q/20203733/1281433>
prefix dc: <http://purl.org/dc/elements/1.1/>

select ?x ?title where {
  # this is just in place of this.getTitle().  It provides a value for
  # ?TITLE that is "Gatsby Strikes Again".
  values ?TITLE { "Gatsby Strikes Again" }

  # Select a thing and its title.
  ?x dc:title ?title .

  # Then filter based on whether the ?title matches the result
  # of replacing the strings in ?TITLE with "|", and matching
  # case insensitively.
  filter( regex( ?title, replace( ?TITLE, " ", "|" ), "i" ))
}

获得类似的结果

------------------------
| x  | title           |
========================
| :b | "Boring Gatsby" |
| :a | "Great Gatsby"  |
------------------------

与此特别整洁的是,由于您正在动态生成模式,因此您甚至可以基于图形模式中的另一个值来制作它.例如,如果您希望标题对至少一个单词都匹配的所有事物对,都可以这样做:

What's particularly neat about this is that since you're generating the pattern on the fly, you could even make it based on another value from the graph pattern. For instance, if you want all pairs of things whose titles match on at least one word, you could do:

prefix : <http://stackoverflow.com/q/20203733/1281433>
prefix dc: <http://purl.org/dc/elements/1.1/>

select ?x ?xtitle ?y ?ytitle where {
  ?x dc:title ?xtitle .
  ?y dc:title ?ytitle .
  filter( regex( ?xtitle, replace( ?ytitle, " ", "|" ), "i" ) && ?x != ?y )
}
order by ?x ?y

获得:

-----------------------------------------------------------------
| x  | xtitle                   | y  | ytitle                   |
=================================================================
| :a | "Great Gatsby"           | :b | "Boring Gatsby"          |
| :a | "Great Gatsby"           | :c | "Great Expectations"     |
| :a | "Great Gatsby"           | :d | "The Great Muppet Caper" |
| :b | "Boring Gatsby"          | :a | "Great Gatsby"           |
| :c | "Great Expectations"     | :a | "Great Gatsby"           |
| :c | "Great Expectations"     | :d | "The Great Muppet Caper" |
| :d | "The Great Muppet Caper" | :a | "Great Gatsby"           |
| :d | "The Great Muppet Caper" | :c | "Great Expectations"     |
-----------------------------------------------------------------

当然,非常重要的一点是要注意,您现在正在根据数据提取生成模式,这意味着可以将数据放入系统的人可以将非常昂贵的模式放入其中.中断查询并导致拒绝服务.更为平凡的是,如果您的任何标题中包含会干扰正则表达式的字符,您都可能会遇到麻烦.一个有趣的问题是,如果某个东西的标题带有多个空格,则该模式变为The|Words|With||Two|Spaces,因为其中的空模式可能使一切匹配.这是一种有趣的方法,但是有很多警告.

Of course, it's very important to note that you're pulling generating patterns based on your data now, and that means that someone who can put data into your system could put very expensive patterns in to bog down the query and cause a denial-of-service. On a more mundane note, you could run into trouble if any of your titles have characters in them that would interfere with the regular expressions. One interesting problem would be if something had a title with multiple spaces so that the pattern became The|Words|With||Two|Spaces, since the empty pattern in there might make everything match. This is an interesting approach, but it's got a lot of caveats.

通常,您可以按照此处所示进行操作,也可以通过在代码中生成正则表达式(可以进行转义等),也可以使用支持某些基于文本的扩展名的SPARQL引擎(例如, jena-text ,它将Apache Lucene或Apache Solr添加到Apache耶拿).

In general, you could do this as shown here, or by generating the regular expression in code (where you can take care of escaping, etc.), or you could use a SPARQL engine that supports some text-based extensions (e.g., jena-text, which adds Apache Lucene or Apache Solr to Apache Jena).

这篇关于匹配至少有一个共同词的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆