如何获取FaunaDB中包含子字符串的文档 [英] How to get documents that contain sub-string in FaunaDB

查看:101
本文介绍了如何获取FaunaDB中包含子字符串的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试检索名称中包含字符串first的所有任务文档.

I'm trying to retrieve all the tasks documents that have the string first in their name.

我目前有以下代码,但只有在输入正确的名称后,它才有效:

I currently have the following code, but it only works if I pass the exact name:

res, err := db.client.Query(
    f.Map(
        f.Paginate(f.MatchTerm(f.Index("tasks_by_name"), "My first task")),
        f.Lambda("ref", f.Get(f.Var("ref"))),
    ),
)

我想我可以在某处使用ContainsStr(),但是我不知道如何在查询中使用它.

I think I can use ContainsStr() somewhere, but I don't know how to use it in my query.

还可以不用Filter()来做到这一点吗?我问是因为它似乎在分页之后就过滤了,并且弄乱了页面

Also, is there a way to do it without using Filter()? I ask because it seems like it filters after the pagination, and it messes up with the pages

推荐答案

FaunaDB提供了许多结构,这使其功能强大,但您还有很多选择.强大的能力会带来小的学习曲线:).

FaunaDB provides a lot of constructs, this makes it powerful but you have a lot to choose from. With great power comes a small learning curve :).

为清楚起见,我在这里使用FQL的JavaScript风格,通常从JavaScript 驱动程序如下:

To be clear, I use the JavaScript flavor of FQL here and typically expose the FQL functions from the JavaScript driver as follows:

const faunadb = require('faunadb')
const q = faunadb.query
const {
  Not,
  Abort,
  ...
} = q

您必须小心地导出Map,因为那样会与JavaScript map冲突.在这种情况下,您可以只使用q.Map.

You do have to be careful to export Map like that since it will conflict with JavaScripts map. In that case, you could just use q.Map.

根据文档

ContainsStr('Fauna', 'a')

当然,这适用于特定值,因此为了使它起作用,您需要过滤器"和过滤器"仅适用于分页集.这意味着我们首先需要获得一个分页的集合.获取分页文档集的一种方法是:

Of course, this works on a specific value so in order to make it work you need Filter and Filter only works on paginated sets. That means that we first need to get a paginated set. One way to get a paginated set of documents is:

q.Map(
  Paginate(Documents(Collection('tasks'))),
  Lambda(['ref'], Get(Var('ref')))
)

但是我们可以更有效地执行此操作,因为一次读取===一次,并且我们不需要文档,我们将过滤掉很多文档.有趣的是,一个索引页也是一个读取页,因此我们可以如下定义索引:

But we can do that more efficiently since one get === one read and we don't need the docs, we'll be filtering out a lot of them. It's interesting to know that one index page is also one read so we can define an index as follows:

{
  name: "tasks_name_and_ref",
  unique: false,
  serialized: true,
  source: "tasks",
  terms: [],
  values: [
    {
      field: ["data", "name"]
    },
    {
      field: ["ref"]
    }
  ]
}

由于我们在值中添加了name和ref,因此索引将返回name和ref的页面,可用于过滤.例如,我们可以对索引执行类似的操作,在索引上进行映射,这将为我们返回一个布尔数组.

And since we added name and ref to the values, the index will return pages of name and ref which we can then use to filter. We can, for example, do something similar with indexes, map over them and this will return us an array of booleans.

Map(
  Paginate(Match(Index('tasks_name_and_ref'))),
  Lambda(['name', 'ref'], ContainsStr(Var('name'), 'first'))
)

由于过滤器也适用于数组,因此实际上我们可以简单地将 Map 替换为过滤器.我们还将添加一个小写字母以忽略大小写,我们将获得所需的内容:

Since Filter also works on arrays, we can actually simple replace Map with filter. We'll also add a to lowercase to ignore casing and we have what we need:

Filter(
  Paginate(Match(Index('tasks_name_and_ref'))),
  Lambda(['name', 'ref'], ContainsStr(LowerCase(Var('name')), 'first'))
)

对于我来说,结果是:


{
  "data": [
    [
      "Firstly, we'll have to go and refactor this!",
      Ref(Collection("tasks"), "267120709035098631")
    ],
    [
      "go to a big rock-concert abroad, but let's not dive in headfirst",
      Ref(Collection("tasks"), "267120846106001926")
    ],
    [
      "The first thing to do is dance!",
      Ref(Collection("tasks"), "267120677201379847")
    ]
  ]
}

过滤并缩小页面尺寸

正如您提到的,这并不是您想要的,因为这还意味着如果您请求500页的页面,它们可能会被过滤掉,最终您可能会看到3页的页面,然后是7页的页面.您可能会想,为什么我不能只在页面中获取过滤的元素?好吧,出于性能方面的考虑,这是一个好主意,因为它基本上检查每个值.想象一下,您有一个庞大的馆藏并过滤掉99.99%.您可能必须遍历许多元素才能达到500,而所有成本读取的结果都如此.我们希望价格可以预测:).

Filter and reduced page sizes

As you mentioned, this is not exactly what you want since it also means that if you request pages of 500 in size, they might be filtered out and you might end up with a page of size 3, then one of 7. You might think, why can't I just get my filtered elements in pages? Well, it's a good idea for performance reasons since it basically checks each value. Imagine you have a massive collection and filter out 99.99 percent. You might have to loop over many elements to get to 500 which all cost reads. We want pricing to be predictable :).

每次您想做更有效率的事情时,答案就在于索引. FaunaDB为您提供了实施不同搜索策略的原始能力,但是您必须具有一定的创造力,我在这里为您提供帮助:).

Each time you want to do something more efficient, the answer lies in indexes. FaunaDB provides you with the raw power to implement different search strategies but you'll have to be a bit creative and I'm here to help you with that :).

在索引绑定中,您可以转换文档的属性,并且在我们的第一次尝试中,我们将字符串拆分成单词(由于我不确定要匹配哪种类型,因此我将实现多个)

In Index bindings, you can transform the attributes of your document and in our first attempt we will split the string into words (I'll implement multiple since I'm not entirely sure which kind of matching you want)

我们没有字符串分割功能,但是由于FQL易于扩展,因此我们可以将其自己编写为使用宿主语言(在这种情况下为javascript)绑定到变量,或使用以下社区驱动的库中的一个: https://github.com/shiftx/faunadb-fql-lib

We do not have a string split function but since FQL is easily extended, we can write it ourselves bind to a variable in our host language (in this case javascript), or use one from this community-driven library: https://github.com/shiftx/faunadb-fql-lib

function StringSplit(string: ExprArg, delimiter = " "){
    return If(
        Not(IsString(string)),
        Abort("SplitString only accept strings"),
        q.Map(
            FindStrRegex(string, Concat(["[^\\", delimiter, "]+"])),
            Lambda("res", LowerCase(Select(["data"], Var("res"))))
        )
    )
)

并在我们的绑定中使用它.

And use it in our binding.

CreateIndex({
  name: 'tasks_by_words',
  source: [
    {
      collection: Collection('tasks'),
      fields: {
        words: Query(Lambda('task', StringSplit(Select(['data', 'name']))))
      }
    }
  ],
  terms: [
    {
      binding: 'words'
    }
  ]
})

提示,如果不确定是否正确,则可以始终将绑定以而不是术语添加,然后您会在

Hint, if you are not sure whether you have got it right, you can always throw the binding in values instead of terms and then you'll see in the fauna dashboard whether your index actually contains values:

我们做了什么?我们刚刚编写了一个绑定,该绑定将在编写文档时将值转换为值数组.当您在FaunaDB中对文档数组进行索引时,这些值是单独的索引,但都指向同一文档,这对于我们的搜索实现非常有用.

What did we do? We just wrote a binding that will transform the value into an array of values at the time a document is written. When you index the array of a document in FaunaDB, these values are indexes separately yet point all to the same document which will be very useful for our search implementation.

我们现在可以通过使用以下查询来查找包含字符串"first"作为其单词之一的任务:

We can now find tasks that contain the string 'first' as one of their words by using the following query:

q.Map(
  Paginate(Match(Index('tasks_by_words'), 'first')),
  Lambda('ref', Get(Var('ref')))
)

谁会给我提供以下名称的文件: 要做的第一件事就是跳舞!"

Which will give me the document with name: "The first thing to do is dance!"

另外两个文档中没有确切的单词,那么我们该怎么做呢?

The other two documents didn't contain the exact words, so how do we do that?

要获得精确的包含匹配效率,您需要使用一个名为"NGram"的函数(由于将来会更方便,因此我们尚未对其进行记录).以ngram分隔字符串是一种搜索技术,该搜索技术经常在其他搜索中使用引擎.在FaunaDB中,由于索引和绑定的强大功能,我们可以轻松地应用它. Fwitter示例自动完成的源代码.该示例不适用于您的用例,但我确实将其引用给其他用户,因为它是用于自动填充短字符串,而不是像任务一样在较长的字符串中搜索短字符串.

To get exact contains matching efficient, you need to use a (still undocumented function since we'll make it easier in the future) function called 'NGram'. Dividing a string in ngrams is a search technique that is often used underneath the hood in other search engines. In FaunaDB we can easily apply it as due to the power of the indexes and bindings. The Fwitter example has an example in it's source code that does autocompletion. This example won't work for your use-case but I do reference it for other users since it's meant for autocompleting short strings, not to search a short string in a longer string like a task.

我们将根据您的使用情况对其进行调整.在搜索时,所有这些都是性能和存储的权衡,在FaunaDB中,用户可以选择权衡.请注意,在以前的方法中,我们将每个单词分开存储,使用Ngrams可以进一步拆分单词,以提供某种形式的模糊匹配.不利的一面是,如果您选择错误,索引大小可能会变得非常大(对于搜索引擎来说同样如此,因此为什么它们让您定义不同的算法).

We'll adapt it though for your use-case. When it comes to searching it's all a tradeoff of performance and storage and in FaunaDB users can choose their tradeoff. Note that in the previous approach, we stored each word separately, with Ngrams we'll split words even further to provide some form of fuzzy matching. The downside is that the index size might become very big if you make the wrong choice (this is equally true for search engines, hence why they let you define different algorithms).

NGram本质上所做的是获取一定长度的字符串的子字符串. 例如:

What NGram essentially does is get substrings of a string of a certain length. For example:

NGram('lalala', 3, 3)

会返回:

如果我们知道搜索的字符串长度不会超过某个特定长度(例如长度10)(这是一个折衷方案,增加大小会增加存储要求,但允许您查询更长的字符串),可以编写以下Ngram生成器.

If we know that we won't be searching for strings longer than a certain length, let's say length 10 (it's a tradeoff, increasing the size will increase the storage requirements but allow you to do query for longer strings), you can write the following Ngram generator.

function GenerateNgrams(Phrase) {
  return Distinct(
    Union(
      Let(
        {
          // Reduce this array if you want less ngrams per word.
          indexes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
          indexesFiltered: Filter(
            Var('indexes'),
            // filter out the ones below 0
            Lambda('l', GT(Var('l'), 0))
          ),
          ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(Var('Phrase')), Var('l'), Var('l'))))
        },
        Var('ngramsArray')
      )
    )
  )
}

然后您可以按照以下步骤编写索引:

You can then write your index as followed:

CreateIndex({
  name: 'tasks_by_ngrams_exact',
  // we actually want to sort to get the shortest word that matches first
  source: [
    {
      // If your collections have the same property tht you want to access you can pass a list to the collection
      collection: [Collection('tasks')],
      fields: {
        wordparts: Query(Lambda('task', GenerateNgrams(Select(['data', 'name'], Var('task')))))
      }
    }
  ],
  terms: [
    {
      binding: 'wordparts'
    }
  ]
})

并且您有一个索引支持的搜索,其中您的页面就是您所请求的大小.

And you have an index backed search where your pages are the size you requested.

q.Map(
  Paginate(Match(Index('tasks_by_ngrams_exact'), 'first')),
  Lambda('ref', Get(Var('ref')))
)

选项4:大小为3或trigram(模糊匹配)的索引和Ngram

如果您要进行模糊搜索,通常使用字母 ,在这种情况下,我们的索引将很容易,因此我们将不使用外部函数.

Option 4: indexes and Ngrams of size 3 or trigrams (Fuzzy matching)

If you want fuzzy searching, often trigrams are used, in this case our index will be easy so we're not going to use an external function.

CreateIndex({
  name: 'tasks_by_ngrams',
  source: {
    collection: Collection('tasks'),
    fields: {
      ngrams: Query(Lambda('task', Distinct(NGram(LowerCase(Select(['data', 'name'], Var('task'))), 3, 3))))
    }
  },
  terms: [
    {
      binding: 'ngrams'
    }
  ]
})

如果我们将绑定再次放置在值中以查看结果,我们将看到以下内容: 在这种方法中,我们在索引方和查询方都使用了两个三元组.在查询方面,这意味着我们搜索的第一个"单词也将在Trigrams中划分如下:

If we would place the binding in values again to see what comes out we'll see something like this: In this approach, we use both trigrams on the indexing side as on the querying side. On the querying side, that means that the 'first' word which we search for will also be divided in Trigrams as follows:

例如,我们现在可以如下进行模糊搜索:

For example, we can now do a fuzzy search as follows:

q.Map(
  Paginate(Union(q.Map(NGram('first', 3, 3), Lambda('ngram', Match(Index('tasks_by_ngrams'), Var('ngram')))))),
  Lambda('ref', Get(Var('ref')))
)

在这种情况下,我们实际上进行了3次搜索,我们正在搜索所有三元组并将结果并集.这将返回所有包含第一个句子的句子.

In this case, we do actually 3 searches, we are searching for all of the trigrams and union the results. Which will return us all sentences that contain first.

但是,如果我们拼写错了并且写了 frst ,我们仍然会匹配所有三个字符,因为存在一个匹配的三字组(rst).

But if we would have miss-spelled it and would have written frst we would still match all three since there is a trigram (rst) that matches.

这篇关于如何获取FaunaDB中包含子字符串的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆