具有真正“全文搜索"功能的Sqlite；和拼写错误(一起使用FTS + spellfix) [英] Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)

查看：116 发布时间：2020/11/11 1:47:53 python sqlite full-text-search levenshtein-distance

本文介绍了具有真正“全文搜索"功能的Sqlite；和拼写错误(一起使用FTS + spellfix)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我们有100万行，例如:

Let's say we have 1 million of rows like this:

import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')

背景:

我知道如何使用Sqlite:

Background:

I know how to do this with Sqlite:

使用单字查询查找行，使用，有关如何编译，如何使用它，...):

Find a row with a single-word query, up to a few spelling mistakes with the spellfix module and Levenshtein distance (I have posted a detailed answer here about how to compile it, how to use it, ...):

db.enable_load_extension(True) db.load_extension('./spellfix') c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall() #Query: 'Riehmand' #Answer: [(1, u'Riemann')]

具有1M行，这将非常慢！如此处详细所示， postgresql可能使用trigrams对此进行了优化. Sqlite提供的一种快速解决方案是使用VIRTUAL TABLE USING spellfix:

With 1M rows, this would be super slow! As detailed here, postgresql might have an optimization with this using trigrams. A fast solution, available with Sqlite, is to use a VIRTUAL TABLE USING spellfix:

c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1') c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")') c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall() #Query: 'Riehmand' #Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!

使用FTS(全文搜索")通过查询匹配一个或多个单词 的表达式:

Find an expression with a query matching one or multiple words with FTS ("Full Text Search"):

c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)') c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")') c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall() #Query: 'NUMBERS carmichael' #Answer: [(2, u'All the Carmichael numbers')]

它不区分大小写，您甚至可以使用带有错误顺序的两个单词的查询，等等:FTS确实非常强大.但是缺点是每个查询关键字必须正确拼写，即仅FTS不允许出现拼写错误.

It is case insensitive and you can even use a query with two words in the wrong order, etc.: FTS is quite powerful indeed. But the drawback is that each of the query-keyword must be correctly spelled, i.e. FTS alone doesn't allow spelling mistakes.

如何使用Sqlite 进行全文搜索(FTS)并允许拼写错误??即同时使用"FTS + spellfix"

How to do a Full Text Search (FTS) with Sqlite and also allow spelling mistakes? i.e. "FTS + spellfix" together

示例:

在数据库中的行:"All the Carmichael numbers"

查询:"NUMMBER carmickaeel"应该匹配它！

row in the DB: "All the Carmichael numbers"

query: "NUMMBER carmickaeel" should match it!

如何使用Sqlite做到这一点?

由于此页面指出:

或者，它[spellfix]可以与FTS4一起使用，以使用可能拼写错误的单词进行全文搜索.

Or, it [spellfix] could be used with FTS4 to do full-text search using potentially misspelled words.

链接的问题:与Python + Sqlite(Levenshtein)的字符串相似性距离/编辑距离)

推荐答案

spellfix1文档实际上告诉您如何执行此操作.从概述部分:

The spellfix1 documentation actually tells you how to do this. From the Overview section:

如果您打算将此虚拟表与FTS4表配合使用(用于搜索词的拼写更正)，则可以使用

SELECT term from search_aux WHERE col='*'语句提取所有索引的令牌.

将其与示例相连，其中mytable2是fts4虚拟表，您可以创建一个fts4aux表并将这些标记插入到mytable3 spellfix1表中，

Connecting this with your examples, where mytable2 is your fts4 virtual table, you can create a fts4aux table and insert those tokens into your mytable3 spellfix1 table with:

CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2); INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';

您可能希望进一步限定该查询，以跳过已经在spellfix1中插入的所有字词，否则最终将出现两次输入:

You probably want to further qualify that query to skip any terms already inserted into spellfix1, otherwise you end up with double entries:

INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*' AND term not in (SELECT word from mytable3_vocab);

现在，您可以使用mytable3将拼写错误的单词映射到更正的标记，然后在针对mytable2的MATCH查询中使用这些更正的标记.

Now you can use mytable3 to map misspelled words to corrected tokens, then use those corrected tokens in a MATCH query againsts mytable2.

根据您的需要，这可能意味着您需要自己进行令牌处理和查询构建；没有公开的fts4查询语法解析器.因此，您需要拆分两个令牌的搜索字符串，每个令牌都通过spellfix1表运行以映射到现有令牌，然后将这些令牌提供给fts4查询.

Depending on your neads, this may mean you need to do your own token handling and query building; there is no exposed fts4 query syntax parser. So your two-token search string would need to be split, each token run through the spellfix1 table to map to existing tokens, and then those tokens fed to the fts4 query.

忽略SQL语法来处理此问题，使用Python进行拆分很容易:

Ignoring SQL syntax to handle this, using Python to do the splitting is easy enough:

def spellcheck_terms(conn, terms): cursor = conn.cursor() base_spellfix = """ SELECT :term{0} as term, word FROM spellfix1data WHERE word MATCH :term{0} and top=1 """ terms = terms.split() params = {"term{}".format(i): t for i, t in enumerate(terms, 1)} query = " UNION ".join([ base_spellfix.format(i + 1) for i in range(len(params))]) cursor.execute(query, params) correction_map = dict(cursor) return " ".join([correction_map.get(t, t) for t in terms]) def spellchecked_search(conn, terms): corrected_terms = spellcheck_terms(conn, terms) cursor = conn.cursor() fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?' cursor.execute(fts_query, (corrected_terms,)) return cursor.fetchall()

然后为spellchecked_search(db, "NUMMBER carmickaeel")返回[('All the Carmichael numbers',)].

在Python中保留拼写检查处理，然后可以根据需要支持更复杂的FTS查询；您可能必须重新实现表达式解析器这样做，但是至少Python提供了执行此操作的工具.

Keeping the spellcheck handling in Python then allows you to support more complex FTS queries as needed; you may have to reimplement the expression parser to do so, but at least Python gives you the tools to do just that.

一个完整的示例，将上述方法包装在一个类中，该类仅将术语提取为字母数字字符序列(根据我对表达式语法的了解，就足够了):

A complete example, packaging up the above approach in a class, which simply extract terms as alphanumeric character sequences (which, by my reading of the expression syntax specs, suffices):

import re import sqlite3 import sys class FTS4SpellfixSearch(object): def __init__(self, conn, spellfix1_path): self.conn = conn self.conn.enable_load_extension(True) self.conn.load_extension(spellfix1_path) def create_schema(self): self.conn.executescript( """ CREATE VIRTUAL TABLE IF NOT EXISTS fts4data USING fts4(description text); CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms USING fts4aux(fts4data); CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data USING spellfix1; """ ) def index_text(self, *text): cursor = self.conn.cursor() with self.conn: params = ((t,) for t in text) cursor.executemany("INSERT INTO fts4data VALUES (?)", params) cursor.execute( """ INSERT INTO spellfix1data(word) SELECT term FROM fts4data_terms WHERE col='*' AND term not in (SELECT word from spellfix1data_vocab) """ ) # fts3 / 4 search expression tokenizer # no attempt is made to validate the expression, only # to identify valid search terms and extract them. # the fts3/4 tokenizer considers any alphanumeric ASCII character # and character in the range U+0080 and over to be terms. if sys.maxunicode == 0xFFFF: # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+") else: # UCS4 _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+") def _terms_from_query(self, search_query): """Extract search terms from a fts3/4 query Returns a list of terms and a template such that template.format(*terms) reconstructs the original query. terms using partial* syntax are ignored, as you can't distinguish between a misspelled prefix search that happens to match existing tokens and a valid spelling that happens to have 'near' tokens in the spellfix1 database that would not otherwise be matched by fts4 """ template, terms, lastpos = [], [], 0 for match in self._fts4_expr_terms.finditer(search_query): token, (start, end) = match.group(), match.span() # skip columnname: and partial* terms by checking next character ismeta = search_query[end:end + 1] in {":", "*"} # skip digits if preceded by "NEAR/" ismeta = ismeta or ( token.isdigit() and template and template[-1] == "NEAR" and "/" in search_query[lastpos:start]) if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta: # full search term, not a keyword, column name or partial* terms.append(token) token = "{}" template += search_query[lastpos:start], token lastpos = end template.append(search_query[lastpos:]) return terms, "".join(template) def spellcheck_terms(self, search_query): cursor = self.conn.cursor() base_spellfix = """ SELECT :term{0} as term, word FROM spellfix1data WHERE word MATCH :term{0} and top=1 """ terms, template = self._terms_from_query(search_query) params = {"term{}".format(i): t for i, t in enumerate(terms, 1)} query = " UNION ".join( [base_spellfix.format(i + 1) for i in range(len(params))] ) cursor.execute(query, params) correction_map = dict(cursor) return template.format(*(correction_map.get(t, t) for t in terms)) def search(self, search_query): corrected_query = self.spellcheck_terms(search_query) cursor = self.conn.cursor() fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?" cursor.execute(fts_query, (corrected_query,)) return { "terms": search_query, "corrected": corrected_query, "results": cursor.fetchall(), }

和使用该类的交互式演示:

and an interactive demo using the class:

>>> db = sqlite3.connect(":memory:") >>> fts = FTS4SpellfixSearch(db, './spellfix') >>> fts.create_schema() >>> fts.index_text("All the Carmichael numbers") # your example >>> from pprint import pprint >>> pprint(fts.search('NUMMBER carmickaeel')) {'corrected': 'numbers carmichael', 'results': [('All the Carmichael numbers',)], 'terms': 'NUMMBER carmickaeel'} >>> fts.index_text( ... "They are great", ... "Here some other numbers", ... ) >>> pprint(fts.search('here some')) # edgecase, multiple spellfix matches {'corrected': 'here some', 'results': [('Here some other numbers',)], 'terms': 'here some'} >>> pprint(fts.search('NUMMBER NOT carmickaeel')) # using fts4 query syntax {'corrected': 'numbers NOT carmichael', 'results': [('Here some other numbers',)], 'terms': 'NUMMBER NOT carmickaeel'}

这篇关于具有真正“全文搜索"功能的Sqlite；和拼写错误(一起使用FTS + spellfix)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

具有真正“全文搜索"功能的Sqlite；和拼写错误(一起使用FTS + spellfix) [英] Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)

问题描述

背景:

Background:

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

具有真正“全文搜索"功能的Sqlite；和拼写错误(一起使用FTS + spellfix) [英] Sqlite with real &quot;Full Text Search&quot; and spelling mistakes (FTS+spellfix together)

问题描述

背景:

Background:

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

具有真正“全文搜索"功能的Sqlite；和拼写错误(一起使用FTS + spellfix) [英] Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)

登录关闭