具有真正“全文搜索"功能的Sqlite;和拼写错误(一起使用FTS + spellfix) [英] Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)
问题描述
假设我们有100万行,例如:
Let's say we have 1 million of rows like this:
import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')
背景:
我知道如何使用Sqlite:
Background:
I know how to do this with Sqlite:
-
使用单字查询查找行,使用,有关如何编译,如何使用它,...):
Find a row with a single-word query, up to a few spelling mistakes with the
spellfix
module and Levenshtein distance (I have posted a detailed answer here about how to compile it, how to use it, ...):
db.enable_load_extension(True)
db.load_extension('./spellfix')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall()
#Query: 'Riehmand'
#Answer: [(1, u'Riemann')]
具有1M行,这将非常慢!如此处详细所示, postgresql
可能使用trigrams
对此进行了优化. Sqlite提供的一种快速解决方案是使用VIRTUAL TABLE USING spellfix
:
With 1M rows, this would be super slow! As detailed here, postgresql
might have an optimization with this using trigrams
. A fast solution, available with Sqlite, is to use a VIRTUAL TABLE USING spellfix
:
c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1')
c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")')
c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall()
#Query: 'Riehmand'
#Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!
使用FTS(全文搜索")通过查询匹配一个或多个单词 的表达式:
Find an expression with a query matching one or multiple words with FTS ("Full Text Search"):
c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)')
c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")')
c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall()
#Query: 'NUMBERS carmichael'
#Answer: [(2, u'All the Carmichael numbers')]
它不区分大小写,您甚至可以使用带有错误顺序的两个单词的查询,等等:FTS确实非常强大.但是缺点是每个查询关键字必须正确拼写,即仅FTS不允许出现拼写错误.
It is case insensitive and you can even use a query with two words in the wrong order, etc.: FTS is quite powerful indeed. But the drawback is that each of the query-keyword must be correctly spelled, i.e. FTS alone doesn't allow spelling mistakes.
如何使用Sqlite 进行全文搜索(FTS)并允许拼写错误??即同时使用"FTS + spellfix"
How to do a Full Text Search (FTS) with Sqlite and also allow spelling mistakes? i.e. "FTS + spellfix" together
示例:
- 在数据库中的行:
"All the Carmichael numbers"
- 查询:
"NUMMBER carmickaeel"
应该匹配它!
- row in the DB:
"All the Carmichael numbers"
- query:
"NUMMBER carmickaeel"
should match it!
如何使用Sqlite做到这一点?
由于此页面指出:
或者,它[spellfix]可以与FTS4一起使用,以使用可能拼写错误的单词进行全文搜索.
Or, it [spellfix] could be used with FTS4 to do full-text search using potentially misspelled words.
链接的问题:与Python + Sqlite(Levenshtein)的字符串相似性距离/编辑距离)
推荐答案
spellfix1
文档实际上告诉您如何执行此操作.从 概述部分:
The spellfix1
documentation actually tells you how to do this. From the Overview section:
如果您打算将此虚拟表与FTS4表配合使用(用于搜索词的拼写更正),则可以使用
SELECT term from search_aux WHERE col='*'
语句提取所有索引的令牌.将其与示例相连,其中
mytable2
是fts4虚拟表,您可以创建一个fts4aux
表并将这些标记插入到mytable3
spellfix1表中,Connecting this with your examples, where
mytable2
is your fts4 virtual table, you can create afts4aux
table and insert those tokens into yourmytable3
spellfix1 table with:
CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2); INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';
您可能希望进一步限定该查询,以跳过已经在spellfix1中插入的所有字词,否则最终将出现两次输入:
You probably want to further qualify that query to skip any terms already inserted into spellfix1, otherwise you end up with double entries:
INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*' AND term not in (SELECT word from mytable3_vocab);
现在,您可以使用
mytable3
将拼写错误的单词映射到更正的标记,然后在针对mytable2
的MATCH
查询中使用这些更正的标记.Now you can use
mytable3
to map misspelled words to corrected tokens, then use those corrected tokens in aMATCH
query againstsmytable2
.根据您的需要,这可能意味着您需要自己进行令牌处理和查询构建;没有公开的fts4查询语法解析器.因此,您需要拆分两个令牌的搜索字符串,每个令牌都通过
spellfix1
表运行以映射到现有令牌,然后将这些令牌提供给fts4查询.Depending on your neads, this may mean you need to do your own token handling and query building; there is no exposed fts4 query syntax parser. So your two-token search string would need to be split, each token run through the
spellfix1
table to map to existing tokens, and then those tokens fed to the fts4 query.忽略SQL语法来处理此问题,使用Python进行拆分很容易:
Ignoring SQL syntax to handle this, using Python to do the splitting is easy enough:
def spellcheck_terms(conn, terms): cursor = conn.cursor() base_spellfix = """ SELECT :term{0} as term, word FROM spellfix1data WHERE word MATCH :term{0} and top=1 """ terms = terms.split() params = {"term{}".format(i): t for i, t in enumerate(terms, 1)} query = " UNION ".join([ base_spellfix.format(i + 1) for i in range(len(params))]) cursor.execute(query, params) correction_map = dict(cursor) return " ".join([correction_map.get(t, t) for t in terms]) def spellchecked_search(conn, terms): corrected_terms = spellcheck_terms(conn, terms) cursor = conn.cursor() fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?' cursor.execute(fts_query, (corrected_terms,)) return cursor.fetchall()
然后为
spellchecked_search(db, "NUMMBER carmickaeel")
返回[('All the Carmichael numbers',)]
.在Python中保留拼写检查处理,然后可以根据需要支持更复杂的FTS查询;您可能必须重新实现表达式解析器这样做,但是至少Python提供了执行此操作的工具.
Keeping the spellcheck handling in Python then allows you to support more complex FTS queries as needed; you may have to reimplement the expression parser to do so, but at least Python gives you the tools to do just that.
一个完整的示例,将上述方法包装在一个类中,该类仅将术语提取为字母数字字符序列(根据我对表达式语法的了解,就足够了):
A complete example, packaging up the above approach in a class, which simply extract terms as alphanumeric character sequences (which, by my reading of the expression syntax specs, suffices):
import re import sqlite3 import sys class FTS4SpellfixSearch(object): def __init__(self, conn, spellfix1_path): self.conn = conn self.conn.enable_load_extension(True) self.conn.load_extension(spellfix1_path) def create_schema(self): self.conn.executescript( """ CREATE VIRTUAL TABLE IF NOT EXISTS fts4data USING fts4(description text); CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms USING fts4aux(fts4data); CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data USING spellfix1; """ ) def index_text(self, *text): cursor = self.conn.cursor() with self.conn: params = ((t,) for t in text) cursor.executemany("INSERT INTO fts4data VALUES (?)", params) cursor.execute( """ INSERT INTO spellfix1data(word) SELECT term FROM fts4data_terms WHERE col='*' AND term not in (SELECT word from spellfix1data_vocab) """ ) # fts3 / 4 search expression tokenizer # no attempt is made to validate the expression, only # to identify valid search terms and extract them. # the fts3/4 tokenizer considers any alphanumeric ASCII character # and character in the range U+0080 and over to be terms. if sys.maxunicode == 0xFFFF: # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+") else: # UCS4 _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+") def _terms_from_query(self, search_query): """Extract search terms from a fts3/4 query Returns a list of terms and a template such that template.format(*terms) reconstructs the original query. terms using partial* syntax are ignored, as you can't distinguish between a misspelled prefix search that happens to match existing tokens and a valid spelling that happens to have 'near' tokens in the spellfix1 database that would not otherwise be matched by fts4 """ template, terms, lastpos = [], [], 0 for match in self._fts4_expr_terms.finditer(search_query): token, (start, end) = match.group(), match.span() # skip columnname: and partial* terms by checking next character ismeta = search_query[end:end + 1] in {":", "*"} # skip digits if preceded by "NEAR/" ismeta = ismeta or ( token.isdigit() and template and template[-1] == "NEAR" and "/" in search_query[lastpos:start]) if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta: # full search term, not a keyword, column name or partial* terms.append(token) token = "{}" template += search_query[lastpos:start], token lastpos = end template.append(search_query[lastpos:]) return terms, "".join(template) def spellcheck_terms(self, search_query): cursor = self.conn.cursor() base_spellfix = """ SELECT :term{0} as term, word FROM spellfix1data WHERE word MATCH :term{0} and top=1 """ terms, template = self._terms_from_query(search_query) params = {"term{}".format(i): t for i, t in enumerate(terms, 1)} query = " UNION ".join( [base_spellfix.format(i + 1) for i in range(len(params))] ) cursor.execute(query, params) correction_map = dict(cursor) return template.format(*(correction_map.get(t, t) for t in terms)) def search(self, search_query): corrected_query = self.spellcheck_terms(search_query) cursor = self.conn.cursor() fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?" cursor.execute(fts_query, (corrected_query,)) return { "terms": search_query, "corrected": corrected_query, "results": cursor.fetchall(), }
和使用该类的交互式演示:
and an interactive demo using the class:
>>> db = sqlite3.connect(":memory:") >>> fts = FTS4SpellfixSearch(db, './spellfix') >>> fts.create_schema() >>> fts.index_text("All the Carmichael numbers") # your example >>> from pprint import pprint >>> pprint(fts.search('NUMMBER carmickaeel')) {'corrected': 'numbers carmichael', 'results': [('All the Carmichael numbers',)], 'terms': 'NUMMBER carmickaeel'} >>> fts.index_text( ... "They are great", ... "Here some other numbers", ... ) >>> pprint(fts.search('here some')) # edgecase, multiple spellfix matches {'corrected': 'here some', 'results': [('Here some other numbers',)], 'terms': 'here some'} >>> pprint(fts.search('NUMMBER NOT carmickaeel')) # using fts4 query syntax {'corrected': 'numbers NOT carmichael', 'results': [('Here some other numbers',)], 'terms': 'NUMMBER NOT carmickaeel'}
这篇关于具有真正“全文搜索"功能的Sqlite;和拼写错误(一起使用FTS + spellfix)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!