快速查询格式化数据 [英] Fast query in formatted data

查看:167
本文介绍了快速查询格式化数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的程序中,我需要查询元数据。

In my program I need to query through metadata.

我将数据读入 numpy 记录数组 A 来自类似csv的文本文件**没有重复行**。

I read data into numpy record array A from csv-like text file ** without duplicate rows**.

var1|var2|var3|var4|var5|var6
'a1'|'b1'|'c1'|1.2|2.2|3.4
'a1'|'b1'|'c4'|3.2|6.2|3.2
'a2'|''|'c1'|1.4|5.7|3.8
'a2'|'b1'|'c2'|1.2|2.2|3.4
'a3'|''|'c2'|1.2|2.2|3.4
'a1'|'b2'|'c4'|7.2|6.2|3.2
...

数百万行,嵌套循环中的查询最多可达十亿次(大多数匹配前3列),因此效率变得至关重要。

There are millions of rows and the query in nested loops can be up to billion times (mostly matching the first 3 columns), so the efficiency becomes critical.


  • 获取与给定字符串匹配前3列中的一列或多列的行,例如

  • Get rows matching one or more of the first 3 columns with given strings, e.g.,


  • 匹配记录 var1 ='a2' var2 ='b1'

 ind = np.logical_and(A['var1']=='a2', A['var2']=='b1')


  • 匹配<$ c的记录$ c> var1 ='a2', var2 ='b1' var3 ='c1'

     ind = np.logical_and(np.logical_and(A['var1']=='a2', A['var2']=='b1'), A['var3']=='c1')
    


  • 正如我们所看到的,每次我们将列的所有元素与给定的字符串进行比较。

    As one can see, each time we compare the all elements of columns with given strings.

    我认为映射可能是一种更有效的索引方式,所以我将recarray A 转换为dict D = {'var1_var2_var3 :[var4,var5,var6],...} ,并按搜索键fnmatch(keys,pat)`。我不确定这是一个更好的方法。

    I thought mapping could be a more efficient way for indexing, so I converted the recarray A to a dict D = {'var1_var2_var3: [var4, var5, var6], ...}, and search through the keys byfnmatch(keys, pat)`. I'm not sure it's a better way.

    或者我可以制作一个分层词典 {'var1':{'var2':{' var3':[],...},...},...} 或内存中hdf5 / var1 / var2 / var3 并且只是尝试获取该项目(如果存在)。这看起来是最快的方式吗?

    Or I can make a hierachical dict {'var1':{'var2':{'var3':[],...},...},...} or in-memory hdf5 /var1/var2/var3 and just try to get the item if exists. This looks the fastest way?

    后两种类型的查询不是很频繁,我可以接受numpy recarray比较的方式。

    The latter two types of queries are not very frequent and I can accept the way of numpy recarray comparison.


    • 获取特定范围内后一列中数值的所有行,例如

    • Get all rows the numeric values in the latter columns in a specific range, e.g.,


    • 获取行'1

    • get rows where '1

    ind = np.logical_and(1<A['var4']<3), 0<A['var5']<3)
    


    以上两者的组合,例如


    • 获取<$ c $行c> var2 ='b1','1

    • get rows where var2='b1', '1

    ind = np.logical_and(np.logical_and(A['var2']=='b1', 1<A['var4']<3), 0<A['var5']<3)
    


    SQL 可能是一个好方法,但它看起来太沉重,无法使用数据库完成这项小任务。我无权在任何地方安装数据库支持。

    SQL could be a good way but it looks too heavy to use database for this small task. And I don't have authority to install database support everywhere.

    对快速内存查询的数据结构有何建议? (如果很难有一个简单的自定义实现, sqlite pandas.dateframe 似乎是可能的解决方案,如建议。)

    Any suggestions for data structure for fast in-memory query? (If it is hard to have a simple customed implementation, sqlite and pandas.dateframe seem to be possible solutions, as suggested.)

    推荐答案

    使用您的文件样本('b'代表py3)

    With your file sample ('b' for py3)

    In [51]: txt=b"""var1|var2|var3|var4|var5|var6
        ...: 'a1'|'b1'|'c1'|1.2|2.2|3.4
        ...: 'a1'|'b1'|'c4'|3.2|6.2|3.2
        ...: 'a2'|''|'c1'|1.4|5.7|3.8
        ...: 'a2'|'b1'|'c2'|1.2|2.2|3.4
        ...: 'a3'|''|'c2'|1.2|2.2|3.4
        ...: 'a1'|'b2'|'c4'|7.2|6.2|3.2"""
    

    一个简单的读物给我留下双层报价

    A simple read leaves me with the double layer of quoting

    data = np.genfromtxt(txt.splitlines(), names=True, delimiter='|', dtype=None)
    
    array([(b"'a1'", b"'b1'", b"'c1'", 1.2, 2.2, 3.4), ...
        dtype=[('var1', 'S4'), ('var2', 'S4'), ('var3', 'S4'), ('var4', '<f8'), ('var5', '<f8'), ('var6', '<f8')])
    

    所以我将定义一个转换器来剥离那些(一个 csv 读者可能会做它也是):

    So I'll define a converter to strip those (a csv reader might do it as well):

    def foo(astr):
        return eval(astr)
    
    In [55]: A = np.genfromtxt(txt.splitlines(), names=True, delimiter='|', dtype='U3,U3,U3,f8,f8,f8', converters={0:foo,1:foo,2:foo})
    In [56]: A
    Out[56]: 
    array([('a1', 'b1', 'c1', 1.2, 2.2, 3.4),
           ('a1', 'b1', 'c4', 3.2, 6.2, 3.2), 
           ('a2', '', 'c1', 1.4, 5.7, 3.8),
           ('a2', 'b1', 'c2', 1.2, 2.2, 3.4), 
           ('a3', '', 'c2', 1.2, 2.2, 3.4),
           ('a1', 'b2', 'c4', 7.2, 6.2, 3.2)], 
          dtype=[('var1', '<U3'), ('var2', '<U3'), ('var3', '<U3'), ('var4', '<f8'), ('var5', '<f8'), ('var6', '<f8')])
    

    我可以编写像

    In [57]: (A['var1']=='a2')&(A['var2']=='b1')
    Out[57]: array([False, False, False,  True, False, False], dtype=bool)
    In [58]: (1<A['var4'])&(A['var4']<3)
    Out[58]: array([ True, False,  True,  True,  True, False], dtype=bool)
    

    A 的所有记录的测试正在编译 numpy中完成代码,所以它们不应该那么慢。

    The tests over all records of A are being done in compile numpy code, so they shouldn't be that slow.

    此数据也可被视为2个多列字段

    This data could also be viewed as 2 multicolumn fields

    In [59]: dt = np.dtype([('labels', '<U3', (3,)), ('data', '<f8', (3,))])
    In [60]: A1 = A.view(dt)
    In [61]: A1
    Out[61]: 
    array([(['a1', 'b1', 'c1'], [1.2, 2.2, 3.4]),
           (['a1', 'b1', 'c4'], [3.2, 6.2, 3.2]),
           (['a2', '', 'c1'], [1.4, 5.7, 3.8]),
           (['a2', 'b1', 'c2'], [1.2, 2.2, 3.4]),
           (['a3', '', 'c2'], [1.2, 2.2, 3.4]),
           (['a1', 'b2', 'c4'], [7.2, 6.2, 3.2])], 
          dtype=[('labels', '<U3', (3,)), ('data', '<f8', (3,))])
    

    或直接加载

    A = np.genfromtxt(txt.splitlines(), skip_header=1, delimiter='|', dtype='(3)U3,(3)f8', converters={0:foo,1:foo,2:foo})
    

    然后测试可以写成:

    In [64]: (A1['labels'][:,0]=='a1') & (A1['labels'][:,1]=='b2') & ((A1['data']<6).any(axis=1))
    Out[64]: array([False, False, False, False, False,  True], dtype=bool)
    
    In [65]: (A1['labels'][:,[0,1]]==['a1','b2']).all(axis=1)
    Out[65]: array([False, False, False, False, False,  True], dtype=bool)
    

    有时可能会更清楚地为各列提供自己的ID:

    Sometimes it might be clearer to give individual columns their own id:

    var1 = A1['labels'][:,0]   # or A['var1']
    ....
    (var1=='a1')&(var2='b1')&...
    

    可以保存重复的查询或组合。

    Repeated queries, or combinations could be saved.

    我相信 pandas 将其系列存储在 numpy 数组中,每列的dtype不同(如果类型不同,则为对象dtype一栏)。但我还没有看到关于 pandas 速度和速度技巧的讨论。我不希望速度提高很多,除非它提供某种索引。

    I believe pandas stores its series in numpy arrays, with different dtype for each column (and object dtype if types varies within a column). But I haven't seen discussion of pandas speed and speed tricks. I don't expect much speed improvement unless it provides for some sort of indexing.

    我可以想象将这些数据写入数据库。 sqlite3 是内置的并且具有内存模式,因此您不需要文件访问权限。但是我已经完全没有使用那些我将传递给它的代码了。我也不知道做这些查询是多么容易或多快。

    I can imagine writing this data to a database. sqlite3 is builtin and has a memory mode so you don't need file access. But I'm sufficiently out of practice with that code that I'll pass on demonstrating it. Nor do I have a sense of how easy or fast it is to do these sorts of queries.

    https://mail.scipy.org/pipermail/scipy-user/2007-August/013350.html 有一些代码可以将结构化数组保存到sqlite3数据库。它包含一个将 dtype 转换为表创建语句的函数。

    https://mail.scipy.org/pipermail/scipy-user/2007-August/013350.html has some code that can save a structured array to a sqlite3 database. It includes a function that converts a dtype into a table creation statement.

    ======== ============

    ====================

    我有 pipermail 示例使用 python3 。测试示例有11个字段。有5000条记录,

    I've got that pipermail example working with python3. The test example has 11 fields. With 5000 records,

    data[np.where(data['id']=='id2000')]
    

    比相应的 sqlite3 查询快6倍(带有现有游标):

    is 6x faster than a corresponding sqlite3 query (with an existing cursor):

    cursor.execute('select * from data where id=?',('id2000',))
    cursor.fetchone()
    

    这篇关于快速查询格式化数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆