加速 MongoDB 中的正则表达式字符串搜索 [英] Speed up regex string search in MongoDB

查看:16
本文介绍了加速 MongoDB 中的正则表达式字符串搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 MongoDB 来实现自然语言词典.我有一个词素集合,每个词素都有许多词形作为子文档.这是单个词素的样子:

I'm trying to use MongoDB to implement a natural language dictionary. I have a collection of lexemes, each of which has a number of wordforms as subdocuments. This is what a single lexeme looks like:

{
    "_id" : ObjectId("51ecff7ee36f2317c9000000"),
    "pos" : "N",
    "lemma" : "skrun",
    "gloss" : "screw",
    "wordforms" : [ 
        {
            "number" : "sg",
            "surface_form" : "skrun",
            "phonetic" : "ˈskruːn",
            "gender" : "m"
        }, 
        {
            "number" : "pl",
            "surface_form" : "skrejjen",
            "phonetic" : "'skrɛjjɛn",
            "pattern" : "CCCVCCVC"
        }
    ],
    "source" : "Mayer2013"
}

目前我有大约 4000 个词素的集合,每个词素平均有大约 1000 个词形的列表(而上面只有 2 个).这意味着我的集合中有 4,000,000 个独特的词形,我需要能够在合理的时间内搜索它们.

Currently I have a collection of some 4000 lexemes, and each of these has on average a list of some 1000 wordforms (as opposed to just 2 above). This means I affectively have 4,000,000 unique word forms in the collection, and I need to be able to search through them in a reasonable amount of time.

一个普通的查询看起来像这样:

A normal query would look like this:

db.lexemes.find({"wordforms.surface_form":"skrejjen"})

我在 wordforms.surface_form 上有一个索引,这个搜索非常快.但是,如果我想在搜索中使用通配符,则性能非常糟糕.例如:

I have an index on wordforms.surface_form, and this search is very fast. However if I want to have wildcards in my search, the performance is abyssmal. For example:

db.lexemes.find({"wordforms.surface_form":/skrej/})

需要超过 5 分钟(此时我放弃了等待).正如在这个问题中提到的,众所周知,对索引进行正则表达式搜索是不好的.我知道在正则表达式搜索中添加 ^ 锚有很大帮助,但它也严重限制了我的搜索能力.即使我愿意做出这种牺牲,我也注意到响应时间仍然会因正则表达式而有很大差异.查询

takes over 5 minutes (at which point I gave up waiting). As mentioned in this question, regex-searching on indexes is known to be bad. I know that adding the ^ anchor in regex searches helps a lot, but it also severely limits my search capabilities. Even if I am willing to make that sacrifice, I've noticed the response times can still vary a lot depending on the regex. The query

db.lexemes.find({"wordforms.surface_form":/^s/})

需要 35 秒才能完成.

Takes 35s to complete.

事实上,到目前为止,我获得的最好结果是使用 hint 关闭索引.在这种情况下,事情似乎有了很大的改善.此查询:

The best results I've had so far have in fact been when I turn off the index using hint. In this case, things seem to improve considerably. This query:

db.lexemes.find({"wordforms.surface_form":/skrej/}).hint('_id_')

大约需要 3 秒才能完成.

takes around 3s to complete.

我的问题是,我还能做些什么来改善这些搜索时间?事实上,它们仍然有点慢,我已经在考虑迁移到 MySQL 以期获得性能.但我真的很想保持 Mongo 的灵活性并避免 RDBMS 中所有繁琐的规范化.有什么建议?无论数据库引擎如何,你认为我会遇到一些缓慢的文本数据量吗?

My question is, is there anything else I can do to improve these search times? As they are, they are still a little slow and I am already considering migrating to MySQL in the hopes of getting performance. But I would really like to keep Mongo's flexibility and avoid all the tedious normalisation in a RDBMS. Any suggestions? Do you think I will run into some slowness regardless of DB engine, with this amount of text data?

我知道 Mongo 的新文本搜索 功能,但它的优点(标记化和词干提取)与我的情况无关(更不用说我的语言不受支持).目前尚不清楚文本搜索是否实际上比使用正则表达式更快.

I know about Mongo's new text search feature but the advantages of this (tokenisation and stemming) are not relevant in my case (not to mention my language is not supported). It isn't clear if text search is actually faster than using regex's anyway.

推荐答案

按照 Derick 的建议,我重构了我的数据库中的数据,以便我将wordforms"作为一个集合而不是作为lexemes"下的子文档.结果实际上更好!这里有一些速度比较.最后一个使用 hint 的例子是故意绕过 surface_form 上的索引,这在旧模式中实际上更快.

As suggested by Derick, I refactored the data in my database such that I have "wordforms" as a collection rather than as subdocuments under "lexemes". The results were in fact better! Here are some speed comparisons. The last example using hint is intentionally bypassing the indexes on surface_form, which in the old schema was actually faster.

旧架构(参见原始问题)

Query                                                              Avg. Time
db.lexemes.find({"wordforms.surface_form":"skrun"})                0s
db.lexemes.find({"wordforms.surface_form":/^skr/})                 1.0s
db.lexemes.find({"wordforms.surface_form":/skru/})                 > 3mins !
db.lexemes.find({"wordforms.surface_form":/skru/}).hint('_id_')    2.8s

新架构(参见德里克的回答)

Query                                                              Avg. Time
db.wordforms.find({"surface_form":"skrun"})                        0s
db.wordforms.find({"surface_form":/^skr/})                         0.001s
db.wordforms.find({"surface_form":/skru/})                         1.4s
db.wordforms.find({"surface_form":/skru/}).hint('_id_')            3.0s

对我来说,这是一个很好的证据,表明重构的模式将使搜索更快,并且值得冗余数据(或需要额外的连接).

For me this is pretty good evidence that a refactored schema would make searching faster, and worth the redundant data (or extra join required).

这篇关于加速 MongoDB 中的正则表达式字符串搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆