Python中的辅助内存中索引表示形式 [英] Secondary in-memory index representations in Python

查看:121
本文介绍了Python中的辅助内存中索引表示形式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种有效的解决方案,以使用numpy和arrow这样的高级优化数学程序包在Python中构建二级内存索引.由于性能原因,我将熊猫排除在外.

I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.

辅助索引包含要索引的属性的每个现有值的条目.此条目可以看作是键/值对,属性值作为键,并且作为值,它是指向索引中所有记录的指针的列表具有此值的基本表." -合资. D'Silva等. (2017)

"A secondary index contains an entry for each existing value of the attribute to be indexed. This entry can be seen as a key/value pair with the attribute value as key and as value a list of pointers to all records in the base table that have this value." - JV. D'Silva et al. (2017)

让我们举一个简单的例子,我们稍后可以扩展它以产生一些基准:

Let's take a simple example, we can scale this later on to produce some benchmarks:

import numpy as np

pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')
val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')

有趣地 pyarrow.Array.dictionary_encode 方法可以将值数组转换为接近二级索引的字典编码表示形式.

Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.

val.dictionary_encode()
Out[55]: 
<pyarrow.lib.DictionaryArray object at 0x7ff430d8b4d0>
-- dictionary:
  [
    15.5,
    3.75,
    142.88,
    nan,
    7.2,
    2.1
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    3,
    3,
    3,
    4,
    5
  ]

我已经在此处

因此,问题在于您可以使用Python数据结构在内存中建立二级索引的速度有多快,以有效地保存值和索引.但这只是故事的一半,因为如果索引可以很好地过滤查询(点,范围)和转换,则将很有用-在

So, the question is about how fast you can build a secondary index in memory using Python data structures to hold efficiently values and indices. But this is half the story as the index will be useful if it serves well both filtering queries (point, range) and transformations - reconstruction of row, column and association a.k.a hyperedge in TRIADB. And even this quick description here does not cover how easy it will be to update this kind of index.

由于许多原因,我已经开始研究可能的PyArrow开源解决方案.排序后的字典编码表示形式应具有较小的内存占用和更快/灵活的零拷贝I/O处理的出色组合,通常可以满足问题的要求.

For many reasons, I have started investigating a possible PyArrow open-source solution. A sorted dictionary-encoded representation should generally meet the requirements of the problem with an excellent combination of smaller memory footprint and faster/flexible zero copy I/O processing.

推荐答案

解决方案

我过去和现在都在寻找开源解决方案来解决这个问题,但是我没有找到一个可以满足我的胃口的解决方案.这次,我决定开始构建自己的数据库,并公开讨论它的实现,该实现还涉及null情况,即丢失数据的情况.

Solution

I have searched both in the past and in the present for an open-source solution to this problem but I have not found one that satisfies my appetite. This time I decided to start building my own and discuss openly its implementation that also covers the null case, i.e. missing data scenario.

请注意,二级索引非常接近邻接列表表示形式,这是我的 TRIADB 的核心元素项目,这是寻找解决方案的主要原因.

Do notice that secondary index is very close to adjacency list representation, a core element in my TRIADB project and that is the main reason behind searching for a solution.

让我们从使用numpy

idx = np.sort(np.array(list(zip(pk, val)), dtype=struct_type), order='val')

idx['val']
Out[68]: 
array([  2.1 ,   3.75,   7.2 ,  15.5 , 142.88, 142.88,    nan,    nan,
          nan], dtype=float32)

idx['pk']
Out[69]: array([8, 1, 7, 0, 2, 3, 4, 5, 6], dtype=uint32)

更快的解决方案(通用性较低)

这是特殊但完全有效的情况,其中pk的值在(n)范围内

Faster solution (less generic)

this is the special but perfectly valid case where pk has values in range(n)

idx_pk = np.argsort(val)
idx_pk
Out[91]: array([8, 1, 7, 0, 2, 3, 4, 5, 6])

idx_val = val[idx_pk]
idx_val
Out[93]: array([  2.1 ,   3.75,   7.2 ,  15.5 , 142.88, 142.88,    nan,    nan,   nan], dtype=float32)

根据合资企业的定义,还有更多步骤来获得二级索引表示. D'Silva等.

There are a few more steps to get a secondary index representation according to the definition of JV. D'Silva et al.

  1. 摆脱nan
  2. 计算二级索引的唯一值
  3. 对于每个唯一值,请计算包含该值的表的所有行的主键索引列表

具有邻接表的唯一二级索引

def secondary_index_with_adjacency_list(arr):
    idx_pk = np.argsort(arr)
    idx_val = arr[idx_pk]
    cnt = np.count_nonzero(~np.isnan(idx_val))
    usec_ndx, split_ndx, cnt_arr = np.unique(idx_val[:cnt], return_index=True, return_counts=True)
    adj_list = np.split(idx_pk[:cnt], split_ndx)[1:]

    return usec_ndx, cnt_arr, adj_list

ndx, freq, adj = secondary_index_with_adjacency_list(val)

pd.DataFrame({'val': ndx, 'freq': freq, 'adj': adj})

Out[11]: 
      val  freq     adj
0    2.10     1     [8]
1    3.75     1     [1]
2    7.20     1     [7]
3   15.50     1     [0]
4  142.88     2  [2, 3]

讨论

在实践中,使用具有重复值的二级索引表示要比使用具有表记录指针列表的指针表示要快,但是第二个具有有趣的特性,即它更接近我在其中使用的超图表示形式 TRIADB .

此解决方案中描述的二级索引更适合于分析,过滤不适合内存但以列存储格式存储在磁盘上的大数据集.在这种情况下,对于一组特定的列,可以以内存(列存储)格式重建记录的子集,甚至可以将其显示在超图上(敬请期待TRIADB的下一个版本)

The kind of secondary index described in this solution is more suitable for analysis, filtering of big data sets that don't fit in memory but stored on disk with a column-store format. In that case for a specific set of columns it is possible to reconstruct a subset of records in memory (column-store) format and even present it on a hypergraph (stay tuned for the next release of TRIADB)

这篇关于Python中的辅助内存中索引表示形式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆