将大型SQL查询移至NumPy [英] Moving large SQL query to NumPy

查看:152
本文介绍了将大型SQL查询移至NumPy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的网络应用程序中有一个非常大的MySQL查询,如下所示:

query = 
SELECT video_tag.video_id, (sum(user_rating.rating) * video.rating_norm) as score 

FROM video_tag 
JOIN user_rating ON user_rating.item_id = video_tag.tag_id
JOIN video ON video.id = video_tag.video_id 

WHERE item_type = 3 AND user_id = 1 AND rating != 0 AND video.website_id = 2 
AND rating_norm > 0 AND video_id NOT IN (1,2,3) GROUP BY video_id 
ORDER BY score DESC LIMIT 20"

此查询连接三个表(视频,视频_标签和用户评分),对结果进行分组,并执行一些基本数学运算以计算每个视频的得分.由于表很大,大约需要2秒钟才能运行.

我怀疑不是让SQL完成所有这些工作,而是使用NumPy数组进行此计算会更快. "video"和"video_tag"中的数据是恒定的-因此我可以将这些表一次加载到内存中,而不必每次都ping SQL.

但是,尽管我可以将这三个表加载到三个单独的数组中,但是我很难复制上面的查询(特别是JOIN和GROUP BY部分).有没有人有使用NumPy数组复制SQL查询的经验?

谢谢!

解决方案

NumPy数组的单数据类型约束使这项工作很尴尬.例如,GROUP BY操作隐式要求(至少)一个字段/列的连续值(以求和/求和)和一个字段/列的要进行分区或分组的依据.

当然,NumPy recarrays 可以为每个列(也称为字段")使用不同的数据类型来表示2D数组(或SQL表),但是我发现这些复合数组难以使用.因此,在下面的代码段中,我只使用了常规的 ndarray 类来复制OP的Question中突出显示的两个SQL操作.

模仿NumPy中的 SQL JOIN :

首先,创建两个NumPy数组(A和B),每个数组代表一个SQL表. A的主键在第一栏中; B的外键也在第一列中.

import numpy as NP
A = NP.random.randint(10, 100, 40).reshape(8, 5)
a = NP.random.randint(1, 3, 8).reshape(8, -1)    # add column of primary keys      
A = NP.column_stack((a, A))

B = NP.random.randint(0, 10, 4).reshape(2, 2)
b = NP.array([1, 2])
B = NP.column_stack((b, B))


现在(尝试)使用NumPy数组对象复制 JOIN :

# prepare the array that will hold the 'result set':
AB = NP.column_stack((A, NP.zeros((A.shape[0], B.shape[1]-1))))

def join(A, B) :
    '''
    returns None, side effect is population of 'results set' NumPy array, 'AB';
    pass in A, B, two NumPy 2D arrays, representing the two SQL Tables to join
    '''
    k, v = B[:,0], B[:,1:]
    dx = dict(zip(k, v))
    for i in range(A.shape[0]) :
        AB[i:,-2:] = dx[A[i,0]]


模仿NumPy中的 SQL GROUP BY :

def group_by(AB, col_id) :
    '''
    returns 2D NumPy array aggregated on the unique values in column specified by col_id;
    pass in a 2D NumPy array and the col_id (integer) which holds the unique values to group by
    '''
    uv = NP.unique(AB[:,col_id]) 
    temp = []
    for v in uv :
        ndx = AB[:,0] == v          
        temp.append(NP.sum(AB[:,1:][ndx,], axis=0))
    temp = NP. row_stack(temp)
    uv = uv.reshape(-1, 1)
    return NP.column_stack((uv, temp))



对于测试用例,它们返回正确的结果:

>>> A
  array([[ 1, 92, 50, 67, 51, 75],
         [ 2, 64, 35, 38, 69, 11],
         [ 1, 83, 62, 73, 24, 55],
         [ 2, 54, 71, 38, 15, 73],
         [ 2, 39, 28, 49, 47, 28],
         [ 1, 68, 52, 28, 46, 69],
         [ 2, 82, 98, 24, 97, 98],
         [ 1, 98, 37, 32, 53, 29]])

>>> B
  array([[1, 5, 4],
         [2, 3, 7]])

>>> join(A, B)
  array([[  1.,  92.,  50.,  67.,  51.,  75.,   5.,   4.],
         [  2.,  64.,  35.,  38.,  69.,  11.,   3.,   7.],
         [  1.,  83.,  62.,  73.,  24.,  55.,   5.,   4.],
         [  2.,  54.,  71.,  38.,  15.,  73.,   3.,   7.],
         [  2.,  39.,  28.,  49.,  47.,  28.,   3.,   7.],
         [  1.,  68.,  52.,  28.,  46.,  69.,   5.,   4.],
         [  2.,  82.,  98.,  24.,  97.,  98.,   3.,   7.],
         [  1.,  98.,  37.,  32.,  53.,  29.,   5.,   4.]])

>>> group_by(AB, 0)
  array([[   1.,  341.,  201.,  200.,  174.,  228.,   20.,   16.],
         [   2.,  239.,  232.,  149.,  228.,  210.,   12.,   28.]])

I have a very large MySQL query in my web app that looks like this:

query = 
SELECT video_tag.video_id, (sum(user_rating.rating) * video.rating_norm) as score 

FROM video_tag 
JOIN user_rating ON user_rating.item_id = video_tag.tag_id
JOIN video ON video.id = video_tag.video_id 

WHERE item_type = 3 AND user_id = 1 AND rating != 0 AND video.website_id = 2 
AND rating_norm > 0 AND video_id NOT IN (1,2,3) GROUP BY video_id 
ORDER BY score DESC LIMIT 20"

This query joins three tables (video, video_tag, and user_rating), groups the results, and does some basic math to compute a score for each video. This takes about 2s to run as the tables are large.

Instead of making SQL do all this work, I suspect it would be faster to do this computation using NumPy arrays. The data in 'video' and 'video_tag' is constant - so I could just load those table into memory once and not have to ping SQL each time.

However, while I can load these three tables into three separate arrays, I'm having a heck of a time replicating the above query (specifically the JOIN and GROUP BY parts). Has anyone any experience with replicating SQL queries using NumPy arrays?

Thanks!

解决方案

What makes this exercise awkward is the single-data-type constraint for NumPy arrays. For instance, the GROUP BY operation implicitly requires (at least) one field/column of continuous values (to aggregate/sum) and one field/column to partition or group by.

Of course, NumPy recarrays can represent a 2D array (or SQL Table) using a different data type for each column (aka 'Field'), but I find these composite arrays cumbersome to work with. So in the code snippets below, i just used the conventional ndarray class to replicate the two SQL operations highlighted in the OP's Question.

to mimic SQL JOIN in NumPy:

first, create two NumPy arrays (A & B) each to represent an SQL Table. The primary keys for A are in 1st column; foreign key for B also in 1st column.

import numpy as NP
A = NP.random.randint(10, 100, 40).reshape(8, 5)
a = NP.random.randint(1, 3, 8).reshape(8, -1)    # add column of primary keys      
A = NP.column_stack((a, A))

B = NP.random.randint(0, 10, 4).reshape(2, 2)
b = NP.array([1, 2])
B = NP.column_stack((b, B))


Now (attempt to) replicate JOIN using NumPy array objects:

# prepare the array that will hold the 'result set':
AB = NP.column_stack((A, NP.zeros((A.shape[0], B.shape[1]-1))))

def join(A, B) :
    '''
    returns None, side effect is population of 'results set' NumPy array, 'AB';
    pass in A, B, two NumPy 2D arrays, representing the two SQL Tables to join
    '''
    k, v = B[:,0], B[:,1:]
    dx = dict(zip(k, v))
    for i in range(A.shape[0]) :
        AB[i:,-2:] = dx[A[i,0]]


to mimic SQL GROUP BY in NumPy:

def group_by(AB, col_id) :
    '''
    returns 2D NumPy array aggregated on the unique values in column specified by col_id;
    pass in a 2D NumPy array and the col_id (integer) which holds the unique values to group by
    '''
    uv = NP.unique(AB[:,col_id]) 
    temp = []
    for v in uv :
        ndx = AB[:,0] == v          
        temp.append(NP.sum(AB[:,1:][ndx,], axis=0))
    temp = NP. row_stack(temp)
    uv = uv.reshape(-1, 1)
    return NP.column_stack((uv, temp))



for a test case, they return the correct result:

>>> A
  array([[ 1, 92, 50, 67, 51, 75],
         [ 2, 64, 35, 38, 69, 11],
         [ 1, 83, 62, 73, 24, 55],
         [ 2, 54, 71, 38, 15, 73],
         [ 2, 39, 28, 49, 47, 28],
         [ 1, 68, 52, 28, 46, 69],
         [ 2, 82, 98, 24, 97, 98],
         [ 1, 98, 37, 32, 53, 29]])

>>> B
  array([[1, 5, 4],
         [2, 3, 7]])

>>> join(A, B)
  array([[  1.,  92.,  50.,  67.,  51.,  75.,   5.,   4.],
         [  2.,  64.,  35.,  38.,  69.,  11.,   3.,   7.],
         [  1.,  83.,  62.,  73.,  24.,  55.,   5.,   4.],
         [  2.,  54.,  71.,  38.,  15.,  73.,   3.,   7.],
         [  2.,  39.,  28.,  49.,  47.,  28.,   3.,   7.],
         [  1.,  68.,  52.,  28.,  46.,  69.,   5.,   4.],
         [  2.,  82.,  98.,  24.,  97.,  98.,   3.,   7.],
         [  1.,  98.,  37.,  32.,  53.,  29.,   5.,   4.]])

>>> group_by(AB, 0)
  array([[   1.,  341.,  201.,  200.,  174.,  228.,   20.,   16.],
         [   2.,  239.,  232.,  149.,  228.,  210.,   12.,   28.]])

这篇关于将大型SQL查询移至NumPy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆