字符串与 Python + Sqlite 的相似度(Levenshtein 距离/编辑距离) [英] String similarity with Python + Sqlite (Levenshtein distance / edit distance)

查看:54
本文介绍了字符串与 Python + Sqlite 的相似度(Levenshtein 距离/编辑距离)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python+Sqlite 中是否有可用的字符串相似性度量,例如 sqlite3 模块?

用例示例:

导入 sqlite3conn = sqlite3.connect(':memory:')c = conn.cursor()c.execute('CREATE TABLE mytable (id integer, description text)')c.execute('INSERT INTO mytable VALUES (1, "hello world, Guys")')c.execute('INSERT INTO mytable VALUES (2, 大家好")')

此查询应匹配 ID 为 1 的行,而不匹配 ID 为 2 的行:

c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wold gyus") < 6')

如何在 Sqlite+Python 中做到这一点?

关于我目前发现的内容的说明:

  • Levenshtein distance,即单字符编辑的最小数量(将一个单词更改为另一个单词所需的插入、删除或替换)可能很有用,但我不确定 Sqlite 中是否存在官方实现(我见过一些自定义实现,例如 这个)

  • Damerau-Levenshtein 是一样的,除了它还允许两个相邻字符之间的换位;它也被称为编辑距离

  • 我知道可以定义一个函数 我自己,但实现这样的距离将是不平凡的(对数据库进行超高效的自然语言处理比较真的很重要),这就是为什么我想看看 Python/Sqlite 是否已经具有这样的工具p>

  • Sqlite 具有 FTS(全文搜索)功能:FTS3FTS4, FTS5

    CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);/* FTS3 表 */创建表 enrondata2(内容文本);/* 普通表 */SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';/* 0.03 秒 */SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%';/* 22.5 秒 */

    但我没有发现关于字符串比较与这样的相似距离",FTS 的特征 MATCHNEAR 似乎没有字母变化的相似性度量等

  • 此外这个答案表明:

    <块引用>

    SQLite 的 FTS 引擎基于令牌 - 搜索引擎试图匹配的关键字.
    有多种标记器可用,但它们相对简单.简单"标记器简单地拆分每个单词并将其小写:例如,在字符串The quick brown fox jumps over the lazy dog"中,单词jumps"会匹配,但不会匹配jump"."porter" 分词器更高级一些,去除了单词的共轭关系,因此 "jumps" 和 "jumping" 会匹配,但是像 "jmups" 这样的拼写错误不会.

    遗憾的是,后者(无法发现jmups"与jumps"相似)使其在我的用例中不切实际.

解决方案

这是一个现成的例子 test.py:

导入 sqlite3db = sqlite3.connect(':memory:')db.enable_load_extension(真)db.load_extension('./spellfix') # Linux#db.load_extension('./spellfix.dll') # <-- 在 WINDOWS 中取消注释db.enable_load_extension(假)c = db.cursor()c.execute('CREATE TABLE mytable (id integer, description text)')c.execute('INSERT INTO mytable VALUES (1, "hello world, Guys")')c.execute('INSERT INTO mytable VALUES (2, 大家好")')c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wold guy") < 600')打印 c.fetchall()# 输出:[(1, 你好世界,伙计们')]

重要说明:距离 editdist3 已标准化,以便

<块引用>

值100用于插入和删除,150用于替换

<小时>

以下是在 Windows 上首先要做的事情:

  1. 下载https://sqlite.org/2016/sqlite-src-3110100.zip, https://sqlite.org/2016/sqlite-amalgamation-3110100.zip 并解压它们

  2. C:Python27DLLssqlite3.dll 替换为新的 sqlite3.dll 来自这里.如果跳过这个,你会得到一个 sqlite3.OperationalError: The specified procedure could not be found later

  3. 运行:

    调用C:Program Files (x86)Microsoft Visual Studio 12.0VCvcvarsall.bat"

    调用C:Program Files (x86)Microsoft Visual Studio 12.0VCvcvarsall.bat"x64cl/I sqlite-amalgamation-3110100/sqlite-src-3110100/ext/misc/spellfix.c/link/DLL/OUT:spellfix.dll蟒蛇测试.py

    (使用 MinGW,它将是:gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/-o spellfix.dll)

这是在 Linux Debian 上的操作方法:

(基于这个答案)

apt-get -y install unzip build-essential libsqlite3-devwget https://sqlite.org/2016/sqlite-src-3110100.zip解压 sqlite-src-3110100.zipgcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so蟒蛇测试.py

以下是在使用较旧 Python 版本的 Linux Debian 上执行此操作的方法:

如果您的发行版的 Python 有点旧,它将需要另一种方法.由于 sqlite3 模块是 Python 内置的,看起来 升级它并不简单(pip install --upgrade pysqlite 只会升级 pysqlite 模块,而不是底层的 SQLite 库).因此此方法有效,例如如果import sqlite3;打印 sqlite3.sqlite_version 是 3.8.2:

wget https://www.sqlite.org/src/tarball/27392118/SQLite-27392118.tar.gztar xvfz SQLite-27392118.tar.gzcd SQLite-27392118;sh 配置;制作 sqlite3.c ;光盘..gcc -g -fPIC -shared SQLite-27392118/ext/misc/spellfix.c -I SQLite-27392118/src/-o spellfix.sopython test.py # [(1, 你好世界,伙计们')]

Is there a string similarity measure available in Python+Sqlite, for example with the sqlite3 module?

Example of use case:

import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')

This query should match the row with ID 1, but not the row with ID 2:

c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')

How to do this in Sqlite+Python?

Notes about what I've found so far:

  • The Levenshtein distance, i.e. the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other, can be useful, but I'm not sure if an official implementation exists in Sqlite (I've seen a few custom implementations, like this one)

  • The Damerau-Levenshtein is the same, except it also allows transposition between 2 adjacent characters; it is also called the Edit distance

  • I know it's possible to define a function myself, but implementing such a distance will be non-trivial (doing natural language processing comparison super efficiently for databases is really non-trivial), that's why I wanted to see if Python / Sqlite already features such a tool

  • Sqlite has FTS (Full Text Seach) features: FTS3, FTS4, FTS5

    CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
    CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
    SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
    SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
    

    but I don't find about string comparison with such a "similarity distance", FTS's features MATCH or NEAR don't seem to have similarity measure with letters changes, etc.

  • Moreover this answer shows that:

    SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.
    A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.

    The latter (the fact that "jmups" cannot be found as similar to "jumps") makes it unpractical for my use case, sadly.

解决方案

Here is a ready-to-use example test.py:

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')                 # for Linux
#db.load_extension('./spellfix.dll')            # <-- UNCOMMENT HERE FOR WINDOWS
db.enable_load_extension(False)
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wrold guy") < 600')
print c.fetchall()
# Output: [(1, u'hello world, guys')]

Important note: The distance editdist3 is normalized so that

the value of 100 is used for insertion and deletion and 150 is used for substitution


Here is what to do first on Windows:

  1. Download https://sqlite.org/2016/sqlite-src-3110100.zip, https://sqlite.org/2016/sqlite-amalgamation-3110100.zip and unzip them

  2. Replace C:Python27DLLssqlite3.dll by the new sqlite3.dll from here. If skipping this, you'd get a sqlite3.OperationalError: The specified procedure could not be found later

  3. Run:

    call "C:Program Files (x86)Microsoft Visual Studio 12.0VCvcvarsall.bat"  
    

    or

    call "C:Program Files (x86)Microsoft Visual Studio 12.0VCvcvarsall.bat" x64
    cl /I sqlite-amalgamation-3110100/ sqlite-src-3110100/ext/misc/spellfix.c /link /DLL /OUT:spellfix.dll
    python test.py
    

    (With MinGW, it would be: gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/ -o spellfix.dll)

Here is how to do it on Linux Debian:

(based on this answer)

apt-get -y install unzip build-essential libsqlite3-dev
wget https://sqlite.org/2016/sqlite-src-3110100.zip
unzip sqlite-src-3110100.zip
gcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so
python test.py

Here is how to do it on Linux Debian with an older Python version:

If your distribution's Python is a bit old, it will require another method. As sqlite3 module is built-in in Python, it seems not straightforward to upgrade it (pip install --upgrade pysqlite would only upgrade the pysqlite module, not the underlying SQLite library). Thus this method works for example if import sqlite3; print sqlite3.sqlite_version is 3.8.2:

wget https://www.sqlite.org/src/tarball/27392118/SQLite-27392118.tar.gz
tar xvfz SQLite-27392118.tar.gz
cd SQLite-27392118 ; sh configure ; make sqlite3.c ; cd ..
gcc -g -fPIC -shared SQLite-27392118/ext/misc/spellfix.c -I SQLite-27392118/src/ -o spellfix.so
python test.py   # [(1, u'hello world, guys')]

这篇关于字符串与 Python + Sqlite 的相似度(Levenshtein 距离/编辑距离)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆