使用Tablefunc在多个列上枢轴 [英] Pivot on Multiple Columns using Tablefunc

查看:75
本文介绍了使用Tablefunc在多个列上枢轴的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人使用tablefunc来旋转多个变量,而不是仅使用行名吗? 文档注释:

Has anyone used tablefunc to pivot on multiple variables as opposed to only using row name? The documentation notes:

所有带有 相同的row_name值.

The "extra" columns are expected to be the same for all rows with the same row_name value.

我不确定如何在不合并我要透视的列的情况下执行此操作(我非常怀疑这将给我所需的速度).一种可行的方法是使实体成为数字并以毫秒为单位将其添加到语言环境中,但这似乎是一种不稳定的方式.

I'm not sure how to do this without combining the columns that I want to pivot on (which I highly doubt will give me the speed I need). One possible way to do this would be to make the entity numeric and add it to the localt as milliseconds, but this seems like a shaky way to proceed.

我已经编辑了用于回答以下问题的数据: PostgreSQL交叉表查询.

I've edited the data used in a response to this question: PostgreSQL Crosstab Query.

 CREATE TEMP TABLE t4 (
  timeof   timestamp
 ,entity    character
 ,status    integer
 ,ct        integer);

 INSERT INTO t4 VALUES 
  ('2012-01-01', 'a', 1, 1)
 ,('2012-01-01', 'a', 0, 2)
 ,('2012-01-02', 'b', 1, 3)
 ,('2012-01-02', 'c', 0, 4);

 SELECT * FROM crosstab(
     'SELECT timeof, entity, status, ct
      FROM   t4
      ORDER  BY 1,2,3'
     ,$$VALUES (1::text), (0::text)$$)
 AS ct ("Section" timestamp, "Attribute" character, "1" int, "0" int);

返回:


 Section                   | Attribute | 1 | 0
---------------------------+-----------+---+---
 2012-01-01 00:00:00       |     a     | 1 | 2
 2012-01-02 00:00:00       |     b     | 3 | 4

因此,如文档所述,假定额外列(也称为属性")与每个行名也称为部分"相同.因此,即使'entity'的'timeof'值也具有'c'值,它仍会为第二行报告 b .

So as the documentation states, the extra column aka 'Attribute' is assumed to be the same for each row name aka 'Section'. Thus, it reports b for the second row even though 'entity' also has a 'c' value for that 'timeof' value.

所需的输出:

Section                   | Attribute | 1 | 0
--------------------------+-----------+---+---
2012-01-01 00:00:00       |     a     | 1 | 2
2012-01-02 00:00:00       |     b     | 3 |  
2012-01-02 00:00:00       |     c     |   | 4

有什么想法或建议吗?

更多背景知识:我可能需要对数十亿行执行此操作,并且我正在测试以长格式和宽格式存储此数据,并查看是否可以使用tablefunc从长格式到宽格式比常规的聚合函数更有效.
我将每分钟对约300个实体进行约100次测量.通常,我们将需要比较给定实体在给定秒内进行的不同测量,因此我们将需要经常采用宽格式.而且,对特定实体进行的测量是高度可变的.

A little more background: I potentially need to do this for billions of rows and I'm testing out storing this data in long and wide formats and seeing if I can use tablefunc to go from long to wide format more efficiently than with regular aggregate functions.
I'll have about 100 measurements made every minute for around 300 entities. Often, we will need to compare the different measurements made for a given second for a given entity, so we will need to go to wide format very often. Also, the measurements made on a particular entity are highly variable.

我找到了以下资源: http://www.postgresonline.com /journal/categories/24-tablefunc .

推荐答案

查询的问题是 b c 共享相同的时间戳2012-01-02 00:00:00,并且在查询中首先具有timestamptimeof,因此-即使您添加了粗体重点-bc也是属于同一组2012-01-02 00:00:00的额外列.自(引用手册)以来,仅返回第一个(b) :

The problem with your query is that b and c share the same timestamp 2012-01-02 00:00:00, and you have the timestamp column timeof first in your query, so - even though you added bold emphasis - b and c are just extra columns that fall in the same group 2012-01-02 00:00:00. Only the first (b) is returned since (quoting the manual):

row_name列必须是第一列. categoryvalue列必须是该顺序的最后两列. row_namecategory之间的任何列均被视为额外".对于具有相同row_name值的所有行,额外"列预计相同.

The row_name column must be first. The category and value columns must be the last two columns, in that order. Any columns between row_name and category are treated as "extra". The "extra" columns are expected to be the same for all rows with the same row_name value.

强调粗体的人.
只需恢复前两列的顺序即可使entity为行名,并且可以按需工作:

Bold emphasis mine.
Just revert the order of the first two columns to make entity the row name and it works as desired:

SELECT * FROM crosstab(
      'SELECT entity, timeof, status, ct
       FROM   t4
       ORDER  BY 1'
      ,'VALUES (1), (0)')
 AS ct (
    "Attribute" character
   ,"Section" timestamp
   ,"status_1" int
   ,"status_0" int);

entity当然必须是唯一的.

  • row_name 第一
  • (可选)extra next
  • category(由第二个参数定义)和value last .
  • row_name first
  • (optional) extra columns next
  • category (as defined by the second parameter) and value last.

额外的列.来自其他行的值将被忽略,每个row_name仅可填充一列.通常,每个row_name的每一行都相同,但这取决于您.

Extra columns are filled from the first row from each row_name partition. Values from other rows are ignored, there is only one column per row_name to fill. Typically those would be the same for every row of one row_name, but that's up to you.

SELECT localt, entity
     , msrmnt01, msrmnt02, msrmnt03, msrmnt04, msrmnt05  -- , more?
FROM   crosstab(
        'SELECT dense_rank() OVER (ORDER BY localt, entity)::int AS row_name
              , localt, entity -- additional columns
              , msrmnt, val
         FROM   test
         -- WHERE  ???   -- instead of LIMIT at the end
         ORDER  BY localt, entity, msrmnt
         -- LIMIT ???'   -- instead of LIMIT at the end
     , $$SELECT generate_series(1,5)$$)  -- more?
     AS ct (row_name int, localt timestamp, entity int
          , msrmnt01 float8, msrmnt02 float8, msrmnt03 float8, msrmnt04 float8, msrmnt05 float8 -- , more?
            )
LIMIT 1000  -- ??!!

难怪测试中的查询执行得很糟糕.您的测试设置有1400万行,您要处理所有全部,然后用LIMIT 1000将其丢弃.为了减少结果集,请在源查询中添加WHERE条件或LIMIT!

No wonder the queries in your test perform terribly. Your test setup has 14M rows and you process all of them before throwing most of it away with LIMIT 1000. For a reduced result set add WHERE conditions or a LIMIT to the source query!

此外,您使用的阵列在其上不必要的昂贵.我改为使用density_rank()生成代理行名称.

Plus, the array you work with is needlessly expensive on top of it. I generate a surrogate row name with dense_rank() instead.

db< fiddle 在这里-与更简单的测试设置和更少的行.

db<>fiddle here - with a simpler test setup and fewer rows.

这篇关于使用Tablefunc在多个列上枢轴的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆