如何正确实现在sqlite3的可变LIKE语句中使用的索引? [英] How to properly implement indexing for use in variable LIKE statement with sqlite3?
问题描述
我正在尝试在两个表之间进行一些模糊匹配. 一个是我在本地存储的表(9,000行),称为表A.另一个存储为sqlite db(200万+行csv),称为表B. 基本上,我想将表A中的"CompanyNames"列与表B中的"CurrentEntityNames"列进行匹配,并使用它来将表B联接到表A.
I am trying to do some fuzzy matching between two tables. One is a table I have stored locally (9,000 rows), call it table A. The other is stored as a sqlite db (2 million + rows csv), call it table B. Basically, I want to match the column "CompanyNames" from table A with the column "CurrentEntityNames" from table B and use this to left join table B to table A.
我目前能够遍历LIKE语句,并传递如下所示的参数: (myNames只是表A中的CompanyNames列作为列表).
I am currently able to loop through the LIKE statements, passing a parameter like so: (myNames is just the column CompanyNames from table A as a list).
for index, name in enumerate(myNames):
sql = 'SELECT * from "table" WHERE CurrentEntityName LIKE ?;'
param =(name + '%%',)
df = pd.read_sql_query(sql,engine, params=param)
myresponses[index] = df
但是,我有两个问题: 1.我意识到也许查询表A中的每一行不是很有效,因为目标是最大程度地减少与数据库的交互. 如果最好重组结构以减少查询,我该怎么做? 2.添加基于CurrentEntityName的索引会更快吗?
However, I have two questions: 1. I realize maybe querying for each row in table A is not very efficient as the goal is to minimize interaction with the db. In the case that it's better to restructure to have less queries, how would I do that? 2. Would adding an index based on CurrentEntityName make this faster?
对于方法2,我尝试使用(在另一个stackoverflow答案中找到)添加索引
For approach 2., I tried to add the index using (found in another stackoverflow answer)
meta = sqlalchemy.MetaData()
meta.reflect(bind=engine)
table = meta.tables['table']
my_index = sqlalchemy.Index('nameIds', table.columns.get('CurrentEntityName'))
但是我不确定在查询时如何实现这一点.
but I'm not sure how to implement this when querying.
对于方法1,我已经看到了一些使用conn和cursor的示例,但是实际上我不确定如何将它们与从引擎创建的数据库一起使用. (我使用
For approach 1., I've seen some examples using conn and cursor but actually I'm not sure how to use these in conjunction with a database created from the engine. (I loaded my data using
for df in pd.read_csv("C://Users//SEAB//Downloads//Active_Corporations___Beginning_1800.csv", chunksize = chunksize, iterator = True):
df = df.rename(columns={c:c.replace(' ', '') for c in df.columns})
df.index +=j
i+=1
df.to_sql('table', engine, if_exists = 'append')
j= df.index[-1] + 1
在本教程中找到的
[ https://plot.ly/python/v3/big-data-analytics-with-pandas-and-sqlite/]
found in this tutorial [https://plot.ly/python/v3/big-data-analytics-with-pandas-and-sqlite/]
基本上,查询仍然非常慢(9000行可能花费超过1个小时). 我真的很感谢任何建议或帮助.我是sqlite3的新手,所以有很多我不知道的地方.谢谢您的理解.
Basically, the query is still really slow (taking maybe more than 1 hour for 9000 rows). I really appreciate any advice or help. I'm new to sqlite3 so there's a lot I don't know. Thank you for your understanding.
推荐答案
规则关于Sqlite何时可以使用带有LIKE
的索引的很多方法,但是它可以做到.
The rules for when Sqlite can use a index with LIKE
are many, but it can do it.
本质上,考虑到默认的不区分大小写的行为:您需要具有 TEXT
亲和力的列在左侧.右侧必须是格式为'XXX%'的字符串文字(或者,如果语句是用sqlite3_prepare_v2()
编译的,则是绑定到字符串的参数),即任何通配符前的前导常数值.鉴于此,如果左侧列上有不区分大小写的索引,它可以重写查询以使用该索引,而不用查看每一行.
Essentially, given the default case-insensitive behavior: You need a column with TEXT
affinity on the left hand side. The right hand side needs to be a string literal (or, if the statement is compiled with sqlite3_prepare_v2()
, a parameter bound to a string) of the format 'XXX%' - that is, a leading constant value before any wildcards. Given that, if there's a case-insensitive index on the left hand column, it can rewrite the query to use that index instead of looking at every single row.
交互式会话中的一些示例:
Some examples from an interactive session:
sqlite> CREATE TABLE ex(col1 TEXT, col2 TEXT COLLATE NOCASE);
sqlite> CREATE INDEX ex_col1_idx ON ex(col1);
sqlite> CREATE INDEX ex_col2_idx ON ex(col2);
sqlite> EXPLAIN QUERY PLAN SELECT * FROM ex WHERE col1 LIKE 'foo%';
QUERY PLAN
`--SCAN TABLE ex
sqlite> EXPLAIN QUERY PLAN SELECT * FROM ex WHERE col2 LIKE 'foo%';
QUERY PLAN
`--SEARCH TABLE ex USING INDEX ex_col2_idx (col2>? AND col2<?)
sqlite> CREATE INDEX ex_col1_idx_nocase ON ex(col1 COLLATE NOCASE);
sqlite> EXPLAIN QUERY PLAN SELECT * FROM ex WHERE col1 LIKE 'foo%';
QUERY PLAN
`--SEARCH TABLE ex USING INDEX ex_col1_idx_nocase (col1>? AND col1<?)
sqlite> .parameter init
sqlite> .parameter set ?1 'foo%'
sqlite> EXPLAIN QUERY PLAN SELECT * FROM ex WHERE col1 LIKE ?;
QUERY PLAN
`--SEARCH TABLE ex USING INDEX ex_col1_idx_nocase (col1>? AND col1<?)
如您所见,要搜索的索引列需要在表定义中显式地指定不区分大小写的排序规则,或者具有显式不区分大小写的索引.
As you can see, the indexed column being searched needs to explicitly be given a case-insensitive collation in the table definition, or have an explicitly case-insensitive index.
在您遇到的情况中,最有可能发生问题的就是Python sqlite绑定如何准备与execute
方法一起使用的语句-它使用旧的sqlite3_prepare()
还是较新的sqlite3_prepare_v2()
API?如果我正在查看正确的源文件它使用了后者,因此这不应该成为问题.
The big potential for things going bad in your case is how the Python sqlite bindings prepare the statements used with execute
methods - does it use the old sqlite3_prepare()
or the newer sqlite3_prepare_v2()
API? If I'm looking at the right source file it uses the latter, so that shouldn't be an issue.
这篇关于如何正确实现在sqlite3的可变LIKE语句中使用的索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!