具有真正“全文搜索"功能的Sqlite;和拼写错误(一起使用FTS + spellfix) [英] Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)

查看:116
本文介绍了具有真正“全文搜索"功能的Sqlite;和拼写错误(一起使用FTS + spellfix)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有100万行,例如:

Let's say we have 1 million of rows like this:

import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')

背景:

我知道如何使用Sqlite:

Background:

I know how to do this with Sqlite:

  • Find a row with a single-word query, up to a few spelling mistakes with the spellfix module and Levenshtein distance (I have posted a detailed answer here about how to compile it, how to use it, ...):

db.enable_load_extension(True)
db.load_extension('./spellfix')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall()

#Query: 'Riehmand'
#Answer: [(1, u'Riemann')]

具有1M行,这将非常慢!如此处详细所示, postgresql可能使用trigrams对此进行了优化. Sqlite提供的一种快速解决方案是使用VIRTUAL TABLE USING spellfix:

With 1M rows, this would be super slow! As detailed here, postgresql might have an optimization with this using trigrams. A fast solution, available with Sqlite, is to use a VIRTUAL TABLE USING spellfix:

c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1')
c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")')
c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall()

#Query: 'Riehmand'
#Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!

  • 使用FTS(全文搜索")通过查询匹配一个或多个单词 的表达式:

  • Find an expression with a query matching one or multiple words with FTS ("Full Text Search"):

    c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)')
    c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")')
    c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall()
    
    #Query: 'NUMBERS carmichael'
    #Answer: [(2, u'All the Carmichael numbers')]
    

    它不区分大小写,您甚至可以使用带有错误顺序的两个单词的查询,等等:FTS确实非常强大.但是缺点是每个查询关键字必须正确拼写,即仅FTS不允许出现拼写错误.

    It is case insensitive and you can even use a query with two words in the wrong order, etc.: FTS is quite powerful indeed. But the drawback is that each of the query-keyword must be correctly spelled, i.e. FTS alone doesn't allow spelling mistakes.

    如何使用Sqlite 进行全文搜索(FTS)并允许拼写错误??即同时使用"FTS + spellfix"

    How to do a Full Text Search (FTS) with Sqlite and also allow spelling mistakes? i.e. "FTS + spellfix" together

    示例:

    • 在数据库中的行:"All the Carmichael numbers"
    • 查询:"NUMMBER carmickaeel"应该匹配它!
    • row in the DB: "All the Carmichael numbers"
    • query: "NUMMBER carmickaeel" should match it!

    如何使用Sqlite做到这一点?

    由于此页面指出:

    或者,它[spellfix]可以与FTS4一起使用,以使用可能拼写错误的单词进行全文搜索.

    Or, it [spellfix] could be used with FTS4 to do full-text search using potentially misspelled words.

    链接的问题:与Python + Sqlite(Levenshtein)的字符串相似性距离/编辑距离)

    推荐答案

    spellfix1文档实际上告诉您如何执行此操作.从 概述部分:

    The spellfix1 documentation actually tells you how to do this. From the Overview section:

    如果您打算将此虚拟表与FTS4表配合使用(用于搜索词的拼写更正),则可以使用

    SELECT term from search_aux WHERE col='*'语句提取所有索引的令牌.

    将其与示例相连,其中mytable2是fts4虚拟表,您可以创建一个fts4aux表并将这些标记插入到mytable3 spellfix1表中,

    Connecting this with your examples, where mytable2 is your fts4 virtual table, you can create a fts4aux table and insert those tokens into your mytable3 spellfix1 table with:

    CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2);
    INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';
    

    您可能希望进一步限定该查询,以跳过已经在spellfix1中插入的​​所有字词,否则最终将出现两次输入:

    You probably want to further qualify that query to skip any terms already inserted into spellfix1, otherwise you end up with double entries:

    INSERT INTO mytable3(word)
        SELECT term FROM mytable2_terms
        WHERE col='*' AND 
            term not in (SELECT word from mytable3_vocab);
    

    现在,您可以使用mytable3将拼写错误的单词映射到更正的标记,然后在针对mytable2MATCH查询中使用这些更正的标记.

    Now you can use mytable3 to map misspelled words to corrected tokens, then use those corrected tokens in a MATCH query againsts mytable2.

    根据您的需要,这可能意味着您需要自己进行令牌处理和查询构建;没有公开的fts4查询语法解析器.因此,您需要拆分两个令牌的搜索字符串,每个令牌都通过spellfix1表运行以映射到现有令牌,然后将这些令牌提供给fts4查询.

    Depending on your neads, this may mean you need to do your own token handling and query building; there is no exposed fts4 query syntax parser. So your two-token search string would need to be split, each token run through the spellfix1 table to map to existing tokens, and then those tokens fed to the fts4 query.

    忽略SQL语法来处理此问题,使用Python进行拆分很容易:

    Ignoring SQL syntax to handle this, using Python to do the splitting is easy enough:

    def spellcheck_terms(conn, terms):
        cursor = conn.cursor()
        base_spellfix = """
            SELECT :term{0} as term, word FROM spellfix1data
            WHERE word MATCH :term{0} and top=1
        """
        terms = terms.split()
        params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
        query = " UNION ".join([
            base_spellfix.format(i + 1) for i in range(len(params))])
        cursor.execute(query, params)
        correction_map = dict(cursor)
        return " ".join([correction_map.get(t, t) for t in terms])
    
    def spellchecked_search(conn, terms):
        corrected_terms = spellcheck_terms(conn, terms)
        cursor = conn.cursor()
        fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?'
        cursor.execute(fts_query, (corrected_terms,))
        return cursor.fetchall()
    

    然后为spellchecked_search(db, "NUMMBER carmickaeel")返回[('All the Carmichael numbers',)].

    在Python中保留拼写检查处理,然后可以根据需要支持更复杂的FTS查询;您可能必须重新实现表达式解析器这样做,但是至少Python提供了执行此操作的工具.

    Keeping the spellcheck handling in Python then allows you to support more complex FTS queries as needed; you may have to reimplement the expression parser to do so, but at least Python gives you the tools to do just that.

    一个完整的示例,将上述方法包装在一个类中,该类仅将术语提取为字母数字字符序列(根据我对表达式语法的了解,就足够了):

    A complete example, packaging up the above approach in a class, which simply extract terms as alphanumeric character sequences (which, by my reading of the expression syntax specs, suffices):

    import re
    import sqlite3
    import sys
    
    class FTS4SpellfixSearch(object):
        def __init__(self, conn, spellfix1_path):
            self.conn = conn
            self.conn.enable_load_extension(True)
            self.conn.load_extension(spellfix1_path)
    
        def create_schema(self):
            self.conn.executescript(
                """
                CREATE VIRTUAL TABLE IF NOT EXISTS fts4data
                    USING fts4(description text);
                CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms
                    USING fts4aux(fts4data);
                CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data
                    USING spellfix1;
                """
            )
    
        def index_text(self, *text):
            cursor = self.conn.cursor()
            with self.conn:
                params = ((t,) for t in text)
                cursor.executemany("INSERT INTO fts4data VALUES (?)", params)
                cursor.execute(
                    """
                    INSERT INTO spellfix1data(word)
                    SELECT term FROM fts4data_terms
                    WHERE col='*' AND
                        term not in (SELECT word from spellfix1data_vocab)
                    """
                )
    
        # fts3 / 4 search expression tokenizer
        # no attempt is made to validate the expression, only
        # to identify valid search terms and extract them.
        # the fts3/4 tokenizer considers any alphanumeric ASCII character
        # and character in the range U+0080 and over to be terms.
        if sys.maxunicode == 0xFFFF:
            # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over
            _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+")
        else:
            # UCS4
            _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+")
    
        def _terms_from_query(self, search_query):
            """Extract search terms from a fts3/4 query
    
            Returns a list of terms and a template such that
            template.format(*terms) reconstructs the original query.
    
            terms using partial* syntax are ignored, as you can't distinguish
            between a misspelled prefix search that happens to match existing
            tokens and a valid spelling that happens to have 'near' tokens in
            the spellfix1 database that would not otherwise be matched by fts4
    
            """
            template, terms, lastpos = [], [], 0
            for match in self._fts4_expr_terms.finditer(search_query):
                token, (start, end) = match.group(), match.span()
                # skip columnname: and partial* terms by checking next character
                ismeta = search_query[end:end + 1] in {":", "*"}
                # skip digits if preceded by "NEAR/"
                ismeta = ismeta or (
                    token.isdigit() and template and template[-1] == "NEAR"
                    and "/" in search_query[lastpos:start])
                if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta:
                    # full search term, not a keyword, column name or partial*
                    terms.append(token)
                    token = "{}"
                template += search_query[lastpos:start], token
                lastpos = end
            template.append(search_query[lastpos:])
            return terms, "".join(template)
    
        def spellcheck_terms(self, search_query):
            cursor = self.conn.cursor()
            base_spellfix = """
                SELECT :term{0} as term, word FROM spellfix1data
                WHERE word MATCH :term{0} and top=1
            """
            terms, template = self._terms_from_query(search_query)
            params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
            query = " UNION ".join(
                [base_spellfix.format(i + 1) for i in range(len(params))]
            )
            cursor.execute(query, params)
            correction_map = dict(cursor)
            return template.format(*(correction_map.get(t, t) for t in terms))
    
        def search(self, search_query):
            corrected_query = self.spellcheck_terms(search_query)
            cursor = self.conn.cursor()
            fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?"
            cursor.execute(fts_query, (corrected_query,))
            return {
                "terms": search_query,
                "corrected": corrected_query,
                "results": cursor.fetchall(),
            }
    

    和使用该类的交互式演示:

    and an interactive demo using the class:

    >>> db = sqlite3.connect(":memory:")
    >>> fts = FTS4SpellfixSearch(db, './spellfix')
    >>> fts.create_schema()
    >>> fts.index_text("All the Carmichael numbers")  # your example
    >>> from pprint import pprint
    >>> pprint(fts.search('NUMMBER carmickaeel'))
    {'corrected': 'numbers carmichael',
     'results': [('All the Carmichael numbers',)],
     'terms': 'NUMMBER carmickaeel'}
    >>> fts.index_text(
    ...     "They are great",
    ...     "Here some other numbers",
    ... )
    >>> pprint(fts.search('here some'))  # edgecase, multiple spellfix matches
    {'corrected': 'here some',
     'results': [('Here some other numbers',)],
     'terms': 'here some'}
    >>> pprint(fts.search('NUMMBER NOT carmickaeel'))  # using fts4 query syntax 
    {'corrected': 'numbers NOT carmichael',
     'results': [('Here some other numbers',)],
     'terms': 'NUMMBER NOT carmickaeel'}
    

    这篇关于具有真正“全文搜索"功能的Sqlite;和拼写错误(一起使用FTS + spellfix)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆