为什么在C是PostgreSQL的数组访问如此之快比PL / pgSQL的? [英] Why is PostgreSQL array access so much faster in C than in PL/pgSQL?

查看:295
本文介绍了为什么在C是PostgreSQL的数组访问如此之快比PL / pgSQL的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表模式,它包括一个int数组列,哪个求和数组内容的自定义聚合函数。换句话说,给定以下

  CREATE TABLE美孚(INT东西[]);INSERT INTO富VALUES({1,2,3});
INSERT INTO富VALUES({4,5,6});

我需要一个求和功能,将返回 {5,7,9} 。在PL / pgSQL的版本,它正常工作,如下:

  CREATE OR REPLACE FUNCTION array_add(数组1 INT []数组2 INT [])返回int [] AS $$
宣布
    结果INT [] = []数组::整数[];
    升INT;
开始
  ---
  ---首先检查如果输入的是NULL,并返回等,如果它是
  ---
  如果数组1为空或数组1 ='{}'THEN
    RETURN数组2;
  ELSEIF数组2为空或数组2 ='{}'THEN
    RETURN数组1;
  万一;  L:= array_upper(数组2,1);  SELECT ARRAY_AGG(数组1 [I] +数组2 [I])FROM generate_series(1,1)我INTO的结果;  返回结果;
结束;
$$ LANGUAGE PLPGSQL;

再加上:

  CREATE总金额(INT [])

    sfunc = array_add,
    STYPE = INT []
);

使用的约15万行的数据集, SELECT SUM(东西)需要在15秒内即可完成。

然后我重新写用C此功能,如下:

 的#include<&了postgres.h GT;
#包括LT&;&fmgr.h GT;
#包括LT&; utils的/ array.h>基准array_add(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(array_add);/ **
 *返回两个int数组的总和。
 * /
基准
array_add(PG_FUNCTION_ARGS)
{
  //正式PostgreSQL的数组对象:
  数组类型*数组1,*数组2;  //数组元素类型(应该始终INT4OID):
  OID arrayElementType1,arrayElementType2;  //数组元素类型的宽度(应该总是4):
  INT16 arrayElementTypeWidth1,arrayElementTypeWidth2;  //数组元素类型标志(不使用时,应始终是真实的)是按值传递:
  布尔arrayElementTypeByValue1,arrayElementTypeByValue2;  //数组元素类型排列codeS(未使用):
  焦炭arrayElementTypeAlignment code1,arrayElementTypeAlignment code2;  //数组内容,PostgreSQL的基准的对象:
  基准* arrayContent1,* arrayContent2;  //列表为空为数组内容的标志:
  BOOL * arrayNullFlags1,* arrayNullFlags2;  //每个数组的大小:
  INT arrayLength1,arrayLength2;  基准* sumContent;
  INT I;
  数组类型* resultArray;
  //提取从传递给这个函数调用的参数PostgreSQL的数组。
  ARRAY1 = PG_GETARG_ARRAYTYPE_P(0);
  数组2 = PG_GETARG_ARRAYTYPE_P(1);  //确定数组元素的类型。
  arrayElementType1 = ARR_ELEMTYPE(数组1);
  get_typlenbyvalalign(arrayElementType1,&安培; arrayElementTypeWidth1,&安培; arrayElementTypeByValue1,&安培; arrayElementTypeAlignment code1);
  arrayElementType2 = ARR_ELEMTYPE(数组2);
  get_typlenbyvalalign(arrayElementType2,&安培; arrayElementTypeWidth2,&安培; arrayElementTypeByValue2,&安培; arrayElementTypeAlignment code2);  //提取数组内容(如基准对象)。
  deconstruct_array(数组1,arrayElementType1,arrayElementTypeWidth1,arrayElementTypeByValue1,arrayElementTypeAlignment code1,
&安培; arrayContent1,&安培; arrayNullFlags1,&安培; arrayLength1);
  deconstruct_array(数组2,arrayElementType2,arrayElementTypeWidth2,arrayElementTypeByValue2,arrayElementTypeAlignment code2,
&安培; arrayContent2,&安培; arrayNullFlags2,&安培; arrayLength2);  //创建和结果的一个新的数组(为基准的对象)。
  sumContent =的palloc(的sizeof(基准)* arrayLength1);  //生成的款项。
  对于(i = 0; I< arrayLength1;我++)
  {
    sumContent [I] = arrayContent1 [I] + arrayContent2 [I]
  }  //裹的款项在一个新的PostgreSQL数组对象。
  resultArray = construct_array(sumContent,arrayLength1,arrayElementType1,arrayElementTypeWidth1,arrayElementTypeByValue1,arrayElementTypeAlignment code1);  //返回最后PostgreSQL的数组对象。
  PG_RETURN_ARRAYTYPE_P(resultArray);
}

此版本只需要800毫秒即可完成,这是....好得多。

(转换为这里的独立扩展:的https: //github.com/ringerc/scrap$c$c/tree/master/postgresql/array_sum

我的问题是,为什么C版本如此之快?我的预期有所改善,但20倍的似乎有点多。这是怎么回事?有什么本质上关于PL / pgSQL的访问阵列慢?

我运行的PostgreSQL 9.0.2,在Fedora Core 8 64位。本机是一种高内存四超大型EC2实例。


解决方案

为什么?


  

为什么是C版的如此之快?


一个PostgreSQL的数组是自身一个pretty低效的数据结构。它可以包含的任何的数据类型和它的能够被多维的,所以大量的优化的是不可能的。然而,正如你已经看到了它可能与同一阵列快得多用C工作。

这是因为在C数组访问可以避免很多参与PL / PGSQL数组访问重复的工作。只是看看文件src / backend / utils的/ ADT / arrayfuncs.c ARRAY_REF 。现在来看看它是如何从的src /后端/遗嘱执行人/ execQual.c 援引 ExecEvalArrayRef 。这对于的每个数组访问的从PL / pgSQL里,你可以通过附加GDB从发现的pid 选择pg_backend_pid()看,设置运行在一个断点 ExecEvalArrayRef ,持续,并运行你的函数。

更重要的是,PL / pgSQL的每次执行语句通过查询执行机器上运行。这使得小型,廉价的语句相当缓慢,甚至允许一个事实,即他们是pre-prepared。是这样的:

  A:= B + C

实际上是由PL / pgSQL里更像是执行:

 选择B + C成;

如果你把足够的调试级别高可以观察到这一点,附加一个调试器,并在合适的突破点,或使用嵌套的报表分析的 auto_explain 模块。为了给你当你运行大量的微小简单的语句(如数组访问),这多少开销强加一个想法,看一看的这个例子回溯以及我的笔记就可以了。

还有一个显著的启动开销的每一个PL / pgSQL函数调用。这不是巨大的,但是当它被用来作为一个聚合它足以增加。

C中的一个更快的方法

在你的情况,我可能会做它在C,你都做了,但我会避免复制阵列时称为聚集。您可以检查无论是在总量范围内被调用

 如果(AggCheckCallContext(fcinfo,NULL))

和如果是这样,使用原来的值作为一个可变的占位符,修改它然后返回它,而不是分配一个新的。我会写一个演示,验证这是否是可能的阵列不久...(更新)或不那么不久,我怎么忘了绝对可怕PostgreSQL的数组在C工作中。在这里,我们去:

  //追加还有contrib / intarray / _int_op.cPG_FUNCTION_INFO_V1(add_intarray_cols);
基准add_intarray_cols(PG_FUNCTION_ARGS);基准
add_intarray_cols(PG_FUNCTION_ARGS)
{
    数组类型*一,
           * B;    INT I,N;    为int * DA,
        *分贝;    如果(PG_ARGISNULL(1))
        在ereport(ERROR,(ERRMSG(第二个操作数必须为非空)));
    B = PG_GETARG_ARRAYTYPE_P(1);
    CHECKARRVALID(二);    如果(AggCheckCallContext(fcinfo,NULL))
    {
        //调用合计背景下...
        如果(PG_ARGISNULL(0))
            // ...在运行的第一次,所以在第一状态
            //参数为空。通过复制创建的状态持有人阵
            //第二个输入数组并返回。
            PG_RETURN_POINTER(copy_intArrayType(B));
        其他
            // ...在同一个运行以后调用,所以我们将修改
            //状态数组直接。
            一个= PG_GETARG_ARRAYTYPE_P(0);
    }
    其他
    {
        //不在总范围内
        如果(PG_ARGISNULL(0))
            在ereport(ERROR,(ERRMSG(第一操作数必须为非空)));
        //复制'A'我们的结果。然后,我们将添加'B'给它。
        一个= PG_GETARG_ARRAYTYPE_P_COPY(0);
        CHECKARRVALID(一);
    }    //这个要求也许可以很容易地解除pretty:
    如果(ARR_NDIM(一)!= 1 || ARR_NDIM(二)!= 1)
        在ereport(ERROR,(ERRMSG(一个数组时二维要求)));    // ...作为假设的未端,甚至为零,但它会是一个会这样
    //小ickier。
    N =(ARR_DIMS(一))[0];
    如果(N!=(ARR_DIMS(B))[0])
        在ereport(ERROR,(ERRMSG(数组的长度不同)));    哒= ARRPTR(一);
    DB = ARRPTR(二);
    对于(i = 0; I< N;我++)
    {
            //失败检查整数溢出。你应该补充一点。
        DA * = * DA + *分贝;
        DA ++;
        DB ++;
    }    PG_RETURN_POINTER(一);
}

和追加这的contrib / intarray / intarray - 1.0.sql

  CREATE FUNCTION add_intarray_cols(_int4,_int4)RETURNS _int4
AS'MODULE_PATHNAME
c语言不可变的;CREATE AGGREGATE sum_intarray_cols(_int4)(sfunc = add_intarray_cols,STYPE = _int4);

(更正确你创建 intarray - 1.1.sql intarray - 1.0--1.1.sql 和更新 intarray.control 。这仅仅是一个快速的黑客攻击。)

使用:

 使USE_PGXS = 1
使USE_PGXS = 1安装

编译和安装。

现在 DROP延长intarray; (如果你已经拥有它)和创建扩展intarray;

您现在将有合计函数 sum_intarray_cols 提供给您(如您的总和(INT4 []) ,以及两个操作数 add_intarray_cols (如你的 array_add )。

通过专业整型数组复杂的一大堆消失。复制一堆避免在总体上的情况下,因为我们可以安全地修改就地国家阵列(第一个参数)。为了保持一致,在非集合调用我们得到的第一个参数的副本的情况下,所以我们仍然可以使用它在的地方,将其返回。

该方法可以推广使用fmgr缓存查找add函数感兴趣的类型(S),以支持任何数据类型,等等。我没有这样做,特别感兴趣,所以如果你需要它(比方说,要总结数字阵列列),然后......玩得开心。

同样,如果您需要处理不同的数组的长度,你也许可以找出什么从上面做的。

I have a table schema which includes an int array column, and a custom aggregate function which sums the array contents. In other words, given the following:

CREATE TABLE foo (stuff INT[]);

INSERT INTO foo VALUES ({ 1, 2, 3 });
INSERT INTO foo VALUES ({ 4, 5, 6 });

I need a "sum" function that would return { 5, 7, 9 }. The PL/pgSQL version, which works correctly, is as follows:

CREATE OR REPLACE FUNCTION array_add(array1 int[], array2 int[]) RETURNS int[] AS $$
DECLARE
    result int[] := ARRAY[]::integer[];
    l int;
BEGIN
  ---
  --- First check if either input is NULL, and return the other if it is
  ---
  IF array1 IS NULL OR array1 = '{}' THEN
    RETURN array2;
  ELSEIF array2 IS NULL OR array2 = '{}' THEN
    RETURN array1;
  END IF;

  l := array_upper(array2, 1);

  SELECT array_agg(array1[i] + array2[i]) FROM generate_series(1, l) i INTO result;

  RETURN result;
END;
$$ LANGUAGE plpgsql;

Coupled with:

CREATE AGGREGATE sum (int[])
(
    sfunc = array_add,
    stype = int[]
);

With a data set of about 150,000 rows, SELECT SUM(stuff) takes over 15 seconds to complete.

I then re-wrote this function in C, as follows:

#include <postgres.h>
#include <fmgr.h>
#include <utils/array.h>

Datum array_add(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(array_add);

/**
 * Returns the sum of two int arrays.
 */
Datum
array_add(PG_FUNCTION_ARGS)
{
  // The formal PostgreSQL array objects:
  ArrayType *array1, *array2;

  // The array element types (should always be INT4OID):
  Oid arrayElementType1, arrayElementType2;

  // The array element type widths (should always be 4):
  int16 arrayElementTypeWidth1, arrayElementTypeWidth2;

  // The array element type "is passed by value" flags (not used, should always be true):
  bool arrayElementTypeByValue1, arrayElementTypeByValue2;

  // The array element type alignment codes (not used):
  char arrayElementTypeAlignmentCode1, arrayElementTypeAlignmentCode2;

  // The array contents, as PostgreSQL "datum" objects:
  Datum *arrayContent1, *arrayContent2;

  // List of "is null" flags for the array contents:
  bool *arrayNullFlags1, *arrayNullFlags2;

  // The size of each array:
  int arrayLength1, arrayLength2;

  Datum* sumContent;
  int i;
  ArrayType* resultArray;


  // Extract the PostgreSQL arrays from the parameters passed to this function call.
  array1 = PG_GETARG_ARRAYTYPE_P(0);
  array2 = PG_GETARG_ARRAYTYPE_P(1);

  // Determine the array element types.
  arrayElementType1 = ARR_ELEMTYPE(array1);
  get_typlenbyvalalign(arrayElementType1, &arrayElementTypeWidth1, &arrayElementTypeByValue1, &arrayElementTypeAlignmentCode1);
  arrayElementType2 = ARR_ELEMTYPE(array2);
  get_typlenbyvalalign(arrayElementType2, &arrayElementTypeWidth2, &arrayElementTypeByValue2, &arrayElementTypeAlignmentCode2);

  // Extract the array contents (as Datum objects).
  deconstruct_array(array1, arrayElementType1, arrayElementTypeWidth1, arrayElementTypeByValue1, arrayElementTypeAlignmentCode1,
&arrayContent1, &arrayNullFlags1, &arrayLength1);
  deconstruct_array(array2, arrayElementType2, arrayElementTypeWidth2, arrayElementTypeByValue2, arrayElementTypeAlignmentCode2,
&arrayContent2, &arrayNullFlags2, &arrayLength2);

  // Create a new array of sum results (as Datum objects).
  sumContent = palloc(sizeof(Datum) * arrayLength1);

  // Generate the sums.
  for (i = 0; i < arrayLength1; i++)
  {
    sumContent[i] = arrayContent1[i] + arrayContent2[i];
  }

  // Wrap the sums in a new PostgreSQL array object.
  resultArray = construct_array(sumContent, arrayLength1, arrayElementType1, arrayElementTypeWidth1, arrayElementTypeByValue1, arrayElementTypeAlignmentCode1);

  // Return the final PostgreSQL array object.
  PG_RETURN_ARRAYTYPE_P(resultArray);
}

This version takes only 800 ms to complete, which is.... much better.

(Converted to a stand-alone extension here: https://github.com/ringerc/scrapcode/tree/master/postgresql/array_sum)

My question is, why is the C version so much faster? I expected an improvement, but 20x seems a bit much. What's going on? Is there something inherently slow about accessing arrays in PL/pgSQL?

I'm running PostgreSQL 9.0.2, on Fedora Core 8 64-bit. The machine is a High-Memory Quadruple Extra-Large EC2 instance.

解决方案

Why?

why is the C version so much faster?

A PostgreSQL array is its self a pretty inefficient data structure. It can contain any data type and it's capable of being multi-dimensional, so lots of optimisations are just not possible. However, as you've seen it's possible to work with the same array much faster in C.

That's because array access in C can avoid a lot of the repeated work involved in PL/PgSQL array access. Just take a look at src/backend/utils/adt/arrayfuncs.c, array_ref. Now look at how it's invoked from src/backend/executor/execQual.c in ExecEvalArrayRef. Which runs for each individual array access from PL/PgSQL, as you can see by attaching gdb to the pid found from select pg_backend_pid(), setting a breakpoint at ExecEvalArrayRef, continuing, and running your function.

More importantly, in PL/PgSQL every statement you execute is run through the query executor machinery. This makes small, cheap statements fairly slow even allowing for the fact that they're pre-prepared. Something like:

a := b + c

is actually executed by PL/PgSQL more like:

SELECT b + c INTO a;

You can observe this if you turn debug levels high enough, attach a debugger and break at a suitable point, or use the auto_explain module with nested statement analysis. To give you an idea of how much overhead this imposes when you're running lots of tiny simple statements (like array accesses), take a look at this example backtrace and my notes on it.

There is also a significant start-up overhead to each PL/PgSQL function invocation. It isn't huge, but it's enough to add up when it's being used as an aggregate.

A faster approach in C

In your case I would probably do it in C, as you have done, but I'd avoid copying the array when called as an aggregate. You can check for whether it's being invoked in aggregate context:

if (AggCheckCallContext(fcinfo, NULL))

and if so, use the original value as a mutable placeholder, modifying it then returning it instead of allocating a new one. I'll write a demo to verify that this is possible with arrays shortly... (update) or not-so-shortly, I forgot how absolute horrible working with PostgreSQL arrays in C is. Here we go:

// append to contrib/intarray/_int_op.c

PG_FUNCTION_INFO_V1(add_intarray_cols);
Datum           add_intarray_cols(PG_FUNCTION_ARGS);

Datum
add_intarray_cols(PG_FUNCTION_ARGS)
{
    ArrayType  *a,
           *b;

    int i, n;

    int *da,
        *db;

    if (PG_ARGISNULL(1))
        ereport(ERROR, (errmsg("Second operand must be non-null")));
    b = PG_GETARG_ARRAYTYPE_P(1);
    CHECKARRVALID(b);

    if (AggCheckCallContext(fcinfo, NULL))
    {
        // Called in aggregate context...
        if (PG_ARGISNULL(0))
            // ... for the first time in a run, so the state in the 1st
            // argument is null. Create a state-holder array by copying the
            // second input array and return it.
            PG_RETURN_POINTER(copy_intArrayType(b));
        else
            // ... for a later invocation in the same run, so we'll modify
            // the state array directly.
            a = PG_GETARG_ARRAYTYPE_P(0);
    }
    else 
    {
        // Not in aggregate context
        if (PG_ARGISNULL(0))
            ereport(ERROR, (errmsg("First operand must be non-null")));
        // Copy 'a' for our result. We'll then add 'b' to it.
        a = PG_GETARG_ARRAYTYPE_P_COPY(0);
        CHECKARRVALID(a);
    }

    // This requirement could probably be lifted pretty easily:
    if (ARR_NDIM(a) != 1 || ARR_NDIM(b) != 1)
        ereport(ERROR, (errmsg("One-dimesional arrays are required")));

    // ... as could this by assuming the un-even ends are zero, but it'd be a
    // little ickier.
    n = (ARR_DIMS(a))[0];
    if (n != (ARR_DIMS(b))[0])
        ereport(ERROR, (errmsg("Arrays are of different lengths")));

    da = ARRPTR(a);
    db = ARRPTR(b);
    for (i = 0; i < n; i++)
    {
            // Fails to check for integer overflow. You should add that.
        *da = *da + *db;
        da++;
        db++;
    }

    PG_RETURN_POINTER(a);
}

and append this to contrib/intarray/intarray--1.0.sql:

CREATE FUNCTION add_intarray_cols(_int4, _int4) RETURNS _int4
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE;

CREATE AGGREGATE sum_intarray_cols(_int4) (sfunc = add_intarray_cols, stype=_int4);

(more correctly you'd create intarray--1.1.sql and intarray--1.0--1.1.sql and update intarray.control. This is just a quick hack.)

Use:

make USE_PGXS=1
make USE_PGXS=1 install

to compile and install.

Now DROP EXTENSION intarray; (if you already have it) and CREATE EXTENSION intarray;.

You'll now have the aggregate function sum_intarray_cols available to you (like your sum(int4[]), as well as the two-operand add_intarray_cols (like your array_add).

By specializing in integer arrays a whole bunch of complexity goes away. A bunch of copying is avoided in the aggregate case, since we can safely modify the "state" array (the first argument) in-place. To keep things consistent, in the case of non-aggregate invocation we get a copy of the first argument so we can still work with it in-place and return it.

This approach could be generalised to support any data type by using the fmgr cache to look up the add function for the type(s) of interest, etc. I'm not particularly interested in doing that, so if you need it (say, to sum columns of NUMERIC arrays) then ... have fun.

Similarly, if you need to handle dissimilar array lengths, you can probably work out what to do from the above.

这篇关于为什么在C是PostgreSQL的数组访问如此之快比PL / pgSQL的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆