为什么 PostgreSQL 数组访问在 C 中比在 PL/pgSQL 中快得多? [英] Why is PostgreSQL array access so much faster in C than in PL/pgSQL?

查看:34
本文介绍了为什么 PostgreSQL 数组访问在 C 中比在 PL/pgSQL 中快得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表架构,其中包括一个 int 数组列和一个对数组内容求和的自定义聚合函数.换句话说,鉴于以下内容:

CREATE TABLE foo (stuff INT[]);插入 foo 值 ({ 1, 2, 3 });插入 foo 值 ({ 4, 5, 6 });

我需要一个返回 { 5, 7, 9 } 的sum"函数.正常运行的PL/pgSQL版本如下:

创建或替换函数 array_add(array1 int[], array2 int[]) 返回 int[] AS $$宣布结果 int[] := ARRAY[]::integer[];l 国际;开始------ 首先检查一个输入是否为NULL,如果是则返回另一个---如果 array1 为空或 array1 = '{}' THEN返回数组2;ELSEIF array2 为 NULL 或 array2 = '{}' THEN返回数组1;万一;l := array_upper(array2, 1);SELECT array_agg(array1[i] + array2[i]) FROM generate_series(1, l) i INTO result;返回结果;结尾;$$ 语言 plpgsql;

加上:

CREATE AGGREGATE sum (int[])(sfunc = array_add,stype = int[]);

对于大约 150,000 行的数据集,SELECT SUM(stuff) 需要 15 秒以上才能完成.

然后我用C重新编写了这个函数,如下:

#include #include #include 数据array_add(PG_FUNCTION_ARGS);PG_FUNCTION_INFO_V1(array_add);/*** 返回两个 int 数组的总和.*/基准数组添加(PG_FUNCTION_ARGS){//正式的 PostgreSQL 数组对象:数组类型 *array1, *array2;//数组元素类型(应始终为 INT4OID):oid arrayElementType1, arrayElementType2;//数组元素类型宽度(应始终为 4):int16 arrayElementTypeWidth1, arrayElementTypeWidth2;//数组元素类型按值传递"标志(未使用,应始终为真):bool arrayElementTypeByValue1, arrayElementTypeByValue2;//数组元素类型对齐代码(未使用):字符数组元素类型对齐代码1,数组元素类型对齐代码2;//数组内容,作为 PostgreSQL 的数据"对象:数据 *arrayContent1, *arrayContent2;//数组内容的为空"标志列表:bool *arrayNullFlags1, *arrayNullFlags2;//每个数组的大小:int arrayLength1, arrayLength2;数据* sumContent;国际我;ArrayType* resultArray;//从传递给此函数调用的参数中提取 PostgreSQL 数组.数组1 = PG_GETARG_ARRAYTYPE_P(0);array2 = PG_GETARG_ARRAYTYPE_P(1);//确定数组元素类型.arrayElementType1 = ARR_ELEMTYPE(array1);get_typlenbyvalalign(arrayElementType1, &arrayElementTypeWidth1, &arrayElementTypeByValue1, &arrayElementTypeAlignmentCode1);arrayElementType2 = ARR_ELEMTYPE(array2);get_typlenbyvalalign(arrayElementType2, &arrayElementTypeWidth2, &arrayElementTypeByValue2, &arrayElementTypeAlignmentCode2);//提取数组内容(作为 Datum 对象).deconstruct_array(array1, arrayElementType1, arrayElementTypeWidth1, arrayElementTypeByValue1, arrayElementTypeAlignmentCode1,&arrayContent1, &arrayNullFlags1, &arrayLength1);deconstruct_array(array2, arrayElementType2, arrayElementTypeWidth2, arrayElementTypeByValue2, arrayElementTypeAlignmentCode2,&arrayContent2, &arrayNullFlags2, &arrayLength2);//创建一个新的求和结果数组(作为 Datum 对象).sumContent = palloc(sizeof(Datum) * arrayLength1);//生成总和.for (i = 0; i 

这个版本只需要 800 毫秒就可以完成,这是....好多了.

(在此处转换为独立扩展:https://github.com/ringerc/scrapcode/tree/master/postgresql/array_sum)

我的问题是,为什么 C 版本的速度要快得多?我期待改进,但 20 倍似乎有点多.这是怎么回事?在 PL/pgSQL 中访问数组是否存在固有的缓慢问题?

我在 Fedora Core 8 64 位上运行 PostgreSQL 9.0.2.该机器是一个高内存四重超大EC2实例.

解决方案

为什么?

<块引用>

为什么 C 版本的速度那么快?

PostgreSQL 数组本身就是一个非常低效的数据结构.它可以包含任何数据类型并且它能够是多维的,所以很多优化是不可能的.但是,正如您所见,在 C 中使用相同的数组可以更快.

那是因为 C 中的数组访问可以避免 PL/PgSQL 数组访问中涉及的大量重复工作.看看src/backend/utils/adt/arrayfuncs.carray_ref.现在看看它是如何从 ExecEvalArrayRef 中的 src/backend/executor/execQual.c 调用的.它针对来自 PL/PgSQL 的每个单独的数组访问运行,正如您可以看到的,将 gdb 附加到从 select pg_backend_pid() 找到的 pid,在 处设置断点ExecEvalArrayRef,继续并运行您的函数.

更重要的是,在 PL/PgSQL 中,您执行的每个语句都通过查询执行器机制运行.这使得小而廉价的语句相当缓慢,即使考虑到它们是预先准备好的.类似的东西:

a := b + c

实际上是由 PL/PgSQL 执行的,更像是:

SELECT b + c INTO a;

如果您将调试级别调得足够高,附加调试器并在合适的点中断,或者使用带有嵌套语句分析的 auto_explain 模块,您就可以观察到这一点.为了让您了解当您运行大量微小的简单语句(如数组访问)时这会带来多少开销,请查看 这个例子回溯和我的笔记.

每个 PL/PgSQL 函数调用也有显着的启动开销.它不是很大,但是当它被用作一个聚合体时,它加起来就足够了.

在 C 中更快的方法

在您的情况下,我可能会像您一样用 C 语言进行操作,但是当作为聚合调用时,我会避免复制数组.您可以检查它是否在聚合上下文中被调用:

if (AggCheckCallContext(fcinfo, NULL))

如果是这样,使用原始值作为可变占位符,修改它然后返回它而不是分配一个新的.我将编写一个演示来验证这是否可以用数组很快......(更新)或者不是很快,我忘记了在 C 中使用 PostgreSQL 数组是多么可怕.我们开始:

//附加到 contrib/intarray/_int_op.cPG_FUNCTION_INFO_V1(add_intarray_cols);数据 add_intarray_cols(PG_FUNCTION_ARGS);基准add_intarray_cols(PG_FUNCTION_ARGS){数组类型 *a,*b;内部 i, n;国际*达,*D b;如果(PG_ARGISNULL(1))ereport(ERROR, (errmsg("第二个操作数必须非空")));b = PG_GETARG_ARRAYTYPE_P(1);校验有效(b);如果(AggCheckCallContext(fcinfo,NULL)){//在聚合上下文中调用...如果(PG_ARGISNULL(0))//... 第一次运行,所以第一次的状态//参数为空.通过复制来创建一个状态持有者数组//第二个输入数组并返回它.PG_RETURN_POINTER(copy_intArrayType(b));别的//... 在同一次运行中稍后调用,所以我们将修改//直接状态数组.a = PG_GETARG_ARRAYTYPE_P(0);}别的{//不在聚合上下文中如果(PG_ARGISNULL(0))ereport(ERROR, (errmsg("第一个操作数必须非空")));//复制 'a' 作为我们的结果.然后我们将添加'b'.a = PG_GETARG_ARRAYTYPE_P_COPY(0);校验有效(a);}//这个要求可能很容易取消:如果 (ARR_NDIM(a) != 1 || ARR_NDIM(b) != 1)ereport(ERROR, (errmsg("需要一维数组")));//... 假设非偶数结束为零,这可能会是一个//有点恶心.n = (ARR_DIMS(a))[0];如果 (n != (ARR_DIMS(b))[0])ereport(ERROR, (errmsg("数组的长度不同")));da = ARRPTR(a);db = ARRPTR(b);for (i = 0; i 

并将其附加到 contrib/intarray/intarray--1.0.sql:

CREATE FUNCTION add_intarray_cols(_int4, _int4) RETURNS _int4AS 'MODULE_PATHNAME'语言 C 不可变;创建聚合 sum_intarray_cols(_int4) (sfunc = add_intarray_cols, stype=_int4);

(更正确的做法是创建 intarray--1.1.sqlintarray--1.0--1.1.sql 并更新 intarray.control.这只是一个快速的技巧.)

使用:

使 USE_PGXS=1使 USE_PGXS=1 安装

编译安装.

现在DROP EXTENSION intarray;(如果你已经有了)和CREATE EXTENSION intarray;.

您现在可以使用聚合函数 sum_intarray_cols(例如您的 sum(int4[]),以及两个操作数 add_intarray_cols(就像你的 array_add).

通过专门研究整数数组,一堆复杂性消失了.在聚合情况下避免了大量复制,因为我们可以安全地就地修改状态"数组(第一个参数).为了保持一致,在非聚合调用的情况下,我们会得到第一个参数的副本,这样我们仍然可以就地使用它并返回它.

通过使用 fmgr 缓存查找感兴趣类型的 add 函数等,这种方法可以推广到支持任何数据类型.我对这样做并不特别感兴趣,所以如果你需要它(例如,对 NUMERIC 数组的列求和)然后......玩得开心.

同样,如果您需要处理不同的数组长度,您可能可以从上述内容中找出要执行的操作.

I have a table schema which includes an int array column, and a custom aggregate function which sums the array contents. In other words, given the following:

CREATE TABLE foo (stuff INT[]);

INSERT INTO foo VALUES ({ 1, 2, 3 });
INSERT INTO foo VALUES ({ 4, 5, 6 });

I need a "sum" function that would return { 5, 7, 9 }. The PL/pgSQL version, which works correctly, is as follows:

CREATE OR REPLACE FUNCTION array_add(array1 int[], array2 int[]) RETURNS int[] AS $$
DECLARE
    result int[] := ARRAY[]::integer[];
    l int;
BEGIN
  ---
  --- First check if either input is NULL, and return the other if it is
  ---
  IF array1 IS NULL OR array1 = '{}' THEN
    RETURN array2;
  ELSEIF array2 IS NULL OR array2 = '{}' THEN
    RETURN array1;
  END IF;

  l := array_upper(array2, 1);

  SELECT array_agg(array1[i] + array2[i]) FROM generate_series(1, l) i INTO result;

  RETURN result;
END;
$$ LANGUAGE plpgsql;

Coupled with:

CREATE AGGREGATE sum (int[])
(
    sfunc = array_add,
    stype = int[]
);

With a data set of about 150,000 rows, SELECT SUM(stuff) takes over 15 seconds to complete.

I then re-wrote this function in C, as follows:

#include <postgres.h>
#include <fmgr.h>
#include <utils/array.h>

Datum array_add(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(array_add);

/**
 * Returns the sum of two int arrays.
 */
Datum
array_add(PG_FUNCTION_ARGS)
{
  // The formal PostgreSQL array objects:
  ArrayType *array1, *array2;

  // The array element types (should always be INT4OID):
  Oid arrayElementType1, arrayElementType2;

  // The array element type widths (should always be 4):
  int16 arrayElementTypeWidth1, arrayElementTypeWidth2;

  // The array element type "is passed by value" flags (not used, should always be true):
  bool arrayElementTypeByValue1, arrayElementTypeByValue2;

  // The array element type alignment codes (not used):
  char arrayElementTypeAlignmentCode1, arrayElementTypeAlignmentCode2;

  // The array contents, as PostgreSQL "datum" objects:
  Datum *arrayContent1, *arrayContent2;

  // List of "is null" flags for the array contents:
  bool *arrayNullFlags1, *arrayNullFlags2;

  // The size of each array:
  int arrayLength1, arrayLength2;

  Datum* sumContent;
  int i;
  ArrayType* resultArray;


  // Extract the PostgreSQL arrays from the parameters passed to this function call.
  array1 = PG_GETARG_ARRAYTYPE_P(0);
  array2 = PG_GETARG_ARRAYTYPE_P(1);

  // Determine the array element types.
  arrayElementType1 = ARR_ELEMTYPE(array1);
  get_typlenbyvalalign(arrayElementType1, &arrayElementTypeWidth1, &arrayElementTypeByValue1, &arrayElementTypeAlignmentCode1);
  arrayElementType2 = ARR_ELEMTYPE(array2);
  get_typlenbyvalalign(arrayElementType2, &arrayElementTypeWidth2, &arrayElementTypeByValue2, &arrayElementTypeAlignmentCode2);

  // Extract the array contents (as Datum objects).
  deconstruct_array(array1, arrayElementType1, arrayElementTypeWidth1, arrayElementTypeByValue1, arrayElementTypeAlignmentCode1,
&arrayContent1, &arrayNullFlags1, &arrayLength1);
  deconstruct_array(array2, arrayElementType2, arrayElementTypeWidth2, arrayElementTypeByValue2, arrayElementTypeAlignmentCode2,
&arrayContent2, &arrayNullFlags2, &arrayLength2);

  // Create a new array of sum results (as Datum objects).
  sumContent = palloc(sizeof(Datum) * arrayLength1);

  // Generate the sums.
  for (i = 0; i < arrayLength1; i++)
  {
    sumContent[i] = arrayContent1[i] + arrayContent2[i];
  }

  // Wrap the sums in a new PostgreSQL array object.
  resultArray = construct_array(sumContent, arrayLength1, arrayElementType1, arrayElementTypeWidth1, arrayElementTypeByValue1, arrayElementTypeAlignmentCode1);

  // Return the final PostgreSQL array object.
  PG_RETURN_ARRAYTYPE_P(resultArray);
}

This version takes only 800 ms to complete, which is.... much better.

(Converted to a stand-alone extension here: https://github.com/ringerc/scrapcode/tree/master/postgresql/array_sum)

My question is, why is the C version so much faster? I expected an improvement, but 20x seems a bit much. What's going on? Is there something inherently slow about accessing arrays in PL/pgSQL?

I'm running PostgreSQL 9.0.2, on Fedora Core 8 64-bit. The machine is a High-Memory Quadruple Extra-Large EC2 instance.

解决方案

Why?

why is the C version so much faster?

A PostgreSQL array is its self a pretty inefficient data structure. It can contain any data type and it's capable of being multi-dimensional, so lots of optimisations are just not possible. However, as you've seen it's possible to work with the same array much faster in C.

That's because array access in C can avoid a lot of the repeated work involved in PL/PgSQL array access. Just take a look at src/backend/utils/adt/arrayfuncs.c, array_ref. Now look at how it's invoked from src/backend/executor/execQual.c in ExecEvalArrayRef. Which runs for each individual array access from PL/PgSQL, as you can see by attaching gdb to the pid found from select pg_backend_pid(), setting a breakpoint at ExecEvalArrayRef, continuing, and running your function.

More importantly, in PL/PgSQL every statement you execute is run through the query executor machinery. This makes small, cheap statements fairly slow even allowing for the fact that they're pre-prepared. Something like:

a := b + c

is actually executed by PL/PgSQL more like:

SELECT b + c INTO a;

You can observe this if you turn debug levels high enough, attach a debugger and break at a suitable point, or use the auto_explain module with nested statement analysis. To give you an idea of how much overhead this imposes when you're running lots of tiny simple statements (like array accesses), take a look at this example backtrace and my notes on it.

There is also a significant start-up overhead to each PL/PgSQL function invocation. It isn't huge, but it's enough to add up when it's being used as an aggregate.

A faster approach in C

In your case I would probably do it in C, as you have done, but I'd avoid copying the array when called as an aggregate. You can check for whether it's being invoked in aggregate context:

if (AggCheckCallContext(fcinfo, NULL))

and if so, use the original value as a mutable placeholder, modifying it then returning it instead of allocating a new one. I'll write a demo to verify that this is possible with arrays shortly... (update) or not-so-shortly, I forgot how absolute horrible working with PostgreSQL arrays in C is. Here we go:

// append to contrib/intarray/_int_op.c

PG_FUNCTION_INFO_V1(add_intarray_cols);
Datum           add_intarray_cols(PG_FUNCTION_ARGS);

Datum
add_intarray_cols(PG_FUNCTION_ARGS)
{
    ArrayType  *a,
           *b;

    int i, n;

    int *da,
        *db;

    if (PG_ARGISNULL(1))
        ereport(ERROR, (errmsg("Second operand must be non-null")));
    b = PG_GETARG_ARRAYTYPE_P(1);
    CHECKARRVALID(b);

    if (AggCheckCallContext(fcinfo, NULL))
    {
        // Called in aggregate context...
        if (PG_ARGISNULL(0))
            // ... for the first time in a run, so the state in the 1st
            // argument is null. Create a state-holder array by copying the
            // second input array and return it.
            PG_RETURN_POINTER(copy_intArrayType(b));
        else
            // ... for a later invocation in the same run, so we'll modify
            // the state array directly.
            a = PG_GETARG_ARRAYTYPE_P(0);
    }
    else 
    {
        // Not in aggregate context
        if (PG_ARGISNULL(0))
            ereport(ERROR, (errmsg("First operand must be non-null")));
        // Copy 'a' for our result. We'll then add 'b' to it.
        a = PG_GETARG_ARRAYTYPE_P_COPY(0);
        CHECKARRVALID(a);
    }

    // This requirement could probably be lifted pretty easily:
    if (ARR_NDIM(a) != 1 || ARR_NDIM(b) != 1)
        ereport(ERROR, (errmsg("One-dimesional arrays are required")));

    // ... as could this by assuming the un-even ends are zero, but it'd be a
    // little ickier.
    n = (ARR_DIMS(a))[0];
    if (n != (ARR_DIMS(b))[0])
        ereport(ERROR, (errmsg("Arrays are of different lengths")));

    da = ARRPTR(a);
    db = ARRPTR(b);
    for (i = 0; i < n; i++)
    {
            // Fails to check for integer overflow. You should add that.
        *da = *da + *db;
        da++;
        db++;
    }

    PG_RETURN_POINTER(a);
}

and append this to contrib/intarray/intarray--1.0.sql:

CREATE FUNCTION add_intarray_cols(_int4, _int4) RETURNS _int4
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE;

CREATE AGGREGATE sum_intarray_cols(_int4) (sfunc = add_intarray_cols, stype=_int4);

(more correctly you'd create intarray--1.1.sql and intarray--1.0--1.1.sql and update intarray.control. This is just a quick hack.)

Use:

make USE_PGXS=1
make USE_PGXS=1 install

to compile and install.

Now DROP EXTENSION intarray; (if you already have it) and CREATE EXTENSION intarray;.

You'll now have the aggregate function sum_intarray_cols available to you (like your sum(int4[]), as well as the two-operand add_intarray_cols (like your array_add).

By specializing in integer arrays a whole bunch of complexity goes away. A bunch of copying is avoided in the aggregate case, since we can safely modify the "state" array (the first argument) in-place. To keep things consistent, in the case of non-aggregate invocation we get a copy of the first argument so we can still work with it in-place and return it.

This approach could be generalised to support any data type by using the fmgr cache to look up the add function for the type(s) of interest, etc. I'm not particularly interested in doing that, so if you need it (say, to sum columns of NUMERIC arrays) then ... have fun.

Similarly, if you need to handle dissimilar array lengths, you can probably work out what to do from the above.

这篇关于为什么 PostgreSQL 数组访问在 C 中比在 PL/pgSQL 中快得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆