加速MongoDB中的正则表达式字符串搜索 [英] Speed up regex string search in MongoDB

查看:169
本文介绍了加速MongoDB中的正则表达式字符串搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用MongoDB来实现自然语言字典。我有一个lexemes集合,每个lexemes都有许多wordforms作为子文档。这就是单个lexeme的样子:

  {
_id:ObjectId(51ecff7ee36f2317c9000000),
pos:N,
lemma:skrun,
gloss:screw,
wordforms:[
{
number:sg,
surface_form:skrun,
phonetic:skruːn,
性别:m
},
{
number:pl,
surface_form:skrejjen,
phonetic:'skrɛjjɛn,
模式:CCCVCCVC
}
],
来源:Mayer2013
}

目前我有大约4000个lexemes的集合,每个lexemes平均有一个1000个字形的列表(而不是上面的2个)。这意味着我在集合中有效地拥有4,000,000个独特的单词形式,我需要能够在合理的时间内搜索它们。



正常查询会看起来像像这样:

  db.lexemes.find({wordforms.surface_form:skrejjen})

我在 wordforms.surface_form 上有一个索引,这个搜索是非常快。
但是如果我想在我的搜索中使用通配符,那么表现就是不可靠的。例如:

  db.lexemes.find({wordforms.surface_form:/ skrej /})

需要5分钟(此时我放弃了等待)。如前所述在这个问题中,对索引的正则表达式搜索已知很糟糕。我知道在正则表达式中添加^锚搜索帮助很多,但是它也严重限制了我的搜索功能。即使我愿意做出这样的牺牲,我也注意到响应时间仍然可以根据正则表达式而变化很大。查询

  db.lexemes.find({wordforms.surface_form:/ ^ s /})

需要35秒才能完成。



我的成绩最好到目前为止,当我使用提示关闭索引时。
在这种情况下,事情似乎有了很大改善。此查询:

  db.lexemes.find({wordforms.surface_form:/ skrej /})。提示('_ id_ ')

需要大约3秒才能完成。



<我的问题是,我还能做些什么来改善这些搜索时间吗?虽然它们是,但它们仍然有点慢,我已经在考虑迁移到MySQL以期获得性能。但我真的希望保持Mongo的灵活性并避免RDBMS中所有繁琐的规范化。有什么建议么?你觉得无论数据库引擎如何,我都会遇到这么慢的文本数据吗?



我知道Mongo的新文本搜索功能,但这个(标记化和词干化)的优点与我的情况无关(更不用说我的了)语言不受支持)。目前尚不清楚文本搜索是否实际上更快而不是使用正则表达式。

解决方案

建议通过Derick,我重构了数据库中的数据,这样我就可以将wordforms作为集合而不是lexemes下的子文档。
结果实际上更好!
以下是一些速度比较。使用提示的最后一个示例是故意绕过 surface_form 上的索引,这在旧架构中实际上更快。



旧架构(请参阅原始问题

 查询平均。时间
db.lexemes.find({wordforms.surface_form:skrun})0s
db.lexemes.find({wordforms.surface_form:/ ^ skr /})1.0s
db.lexemes.find({wordforms.surface_form:/ skru /})> 3分钟!
db.lexemes.find({wordforms.surface_form:/ skru /})。提示('_ id_')2.8s

新架构(请参阅 Derick的回答

 查询平均值时间
db.wordforms.find({surface_form:skrun})0s
db.wordforms.find({surface_form:/ ^ skr /})0.001s
db .wordforms.find({surface_form:/ skru /})1.4s
db.wordforms.find({surface_form:/ skru /})。提示('_ id_')3.0s

对我来说,这是一个非常好的证据,证明重构的模式可以使搜索更快,并且值得冗余数据(或额外的连接)要求)。


I'm trying to use MongoDB to implement a natural language dictionary. I have a collection of lexemes, each of which has a number of wordforms as subdocuments. This is what a single lexeme looks like:

{
    "_id" : ObjectId("51ecff7ee36f2317c9000000"),
    "pos" : "N",
    "lemma" : "skrun",
    "gloss" : "screw",
    "wordforms" : [ 
        {
            "number" : "sg",
            "surface_form" : "skrun",
            "phonetic" : "ˈskruːn",
            "gender" : "m"
        }, 
        {
            "number" : "pl",
            "surface_form" : "skrejjen",
            "phonetic" : "'skrɛjjɛn",
            "pattern" : "CCCVCCVC"
        }
    ],
    "source" : "Mayer2013"
}

Currently I have a collection of some 4000 lexemes, and each of these has on average a list of some 1000 wordforms (as opposed to just 2 above). This means I affectively have 4,000,000 unique word forms in the collection, and I need to be able to search through them in a reasonable amount of time.

A normal query would look like this:

db.lexemes.find({"wordforms.surface_form":"skrejjen"})

I have an index on wordforms.surface_form, and this search is very fast. However if I want to have wildcards in my search, the performance is abyssmal. For example:

db.lexemes.find({"wordforms.surface_form":/skrej/})

takes over 5 minutes (at which point I gave up waiting). As mentioned in this question, regex-searching on indexes is known to be bad. I know that adding the ^ anchor in regex searches helps a lot, but it also severely limits my search capabilities. Even if I am willing to make that sacrifice, I've noticed the response times can still vary a lot depending on the regex. The query

db.lexemes.find({"wordforms.surface_form":/^s/})

Takes 35s to complete.

The best results I've had so far have in fact been when I turn off the index using hint. In this case, things seem to improve considerably. This query:

db.lexemes.find({"wordforms.surface_form":/skrej/}).hint('_id_')

takes around 3s to complete.

My question is, is there anything else I can do to improve these search times? As they are, they are still a little slow and I am already considering migrating to MySQL in the hopes of getting performance. But I would really like to keep Mongo's flexibility and avoid all the tedious normalisation in a RDBMS. Any suggestions? Do you think I will run into some slowness regardless of DB engine, with this amount of text data?

I know about Mongo's new text search feature but the advantages of this (tokenisation and stemming) are not relevant in my case (not to mention my language is not supported). It isn't clear if text search is actually faster than using regex's anyway.

解决方案

As suggested by Derick, I refactored the data in my database such that I have "wordforms" as a collection rather than as subdocuments under "lexemes". The results were in fact better! Here are some speed comparisons. The last example using hint is intentionally bypassing the indexes on surface_form, which in the old schema was actually faster.

Old schema (see original question)

Query                                                              Avg. Time
db.lexemes.find({"wordforms.surface_form":"skrun"})                0s
db.lexemes.find({"wordforms.surface_form":/^skr/})                 1.0s
db.lexemes.find({"wordforms.surface_form":/skru/})                 > 3mins !
db.lexemes.find({"wordforms.surface_form":/skru/}).hint('_id_')    2.8s

New schema (see Derick's answer)

Query                                                              Avg. Time
db.wordforms.find({"surface_form":"skrun"})                        0s
db.wordforms.find({"surface_form":/^skr/})                         0.001s
db.wordforms.find({"surface_form":/skru/})                         1.4s
db.wordforms.find({"surface_form":/skru/}).hint('_id_')            3.0s

For me this is pretty good evidence that a refactored schema would make searching faster, and worth the redundant data (or extra join required).

这篇关于加速MongoDB中的正则表达式字符串搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆