由于存在不相关的字段而导致交叉表拆分结果 [英] Crosstab splitting results due to presence of unrelated field

查看:105
本文介绍了由于存在不相关的字段而导致交叉表拆分结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将postgres 9.1与tablefunc:crosstab一起使用

I'm using postgres 9.1 with tablefunc:crosstab

我有一个具有以下结构的表:

I have a table with the following structure:

CREATE TABLE marketdata.instrument_data
(
  dt date NOT NULL,
  instrument text NOT NULL,
  field text NOT NULL,
  value numeric,
  CONSTRAINT instrument_data_pk PRIMARY KEY (dt , instrument , field )
)

这是由每天获取数据的脚本填充的。因此可能看起来像这样:

This is populated by a script that fetches data daily. So it might look like so:

| dt         | instrument        | field     | value |
|------------+-------------------+-----------+-------|
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1     |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT  | 2     |

然后我使用以下交叉表查询来旋转表:

I then use the following crosstab query to pivot the table:

select dt, instrument, vol, oi 
FROM crosstab($$
    select dt, instrument, field, value 
    from marketdata.instrument_data 
    where field = 'PX_VOLUME' or field = 'OPEN_INT'
    $$::text, $$VALUES ('PX_VOLUME'),('OPEN_INT')$$::text
) vol(dt date, instrument text, vol numeric, oi numeric);

运行此命令,我得到的结果是:

Running this I get the result:

| dt         | instrument        | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1   | 2  |

问题:
实际运行时表中的数据,我注意到对于某些字段,该函数将结果分为两行:

The problem: When running this with lot of real data in the table, I noticed that for some fields the function was splitting the result over two rows:

| dt         | instrument        | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1   |    |
| 2014-05-23 | SGX.MiniJGB.2014U |     | 2  |

我检查了dt和instrument字段是否相同,并通过将

I checked that the dt and instrument fields were identical and produced a work-around by grouping the ouput of the crosstab.

分析
我发现,输入表中存在另一个条目会导致输出分为两行。如果我输入的内容如下:

Analysis I've discovered that it's the presence of one other entry in the input table that causes the output to be split over 2 rows. If I have the input as follows:

| dt         | instrument        | field     | value |
|------------+-------------------+-----------+-------|
| 2014-04-23 | EUX.Bund.2014M    | PX_VOLUME | 0     |
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1     |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT  | 2     |

我得到:

| dt         | instrument        | vol | oi |
|------------+-------------------+-----+----|
| 2014-04-23 | EUX.Bund.2014M    | 0   |    |
| 2014-05-23 | SGX.MiniJGB.2014U | 1   |    |
| 2014-05-23 | SGX.MiniJGB.2014U |     | 2  |

哪里真的很奇怪...

Where it gets really weird...

如果我手动重新创建上面的输入表,那么输出将如我们期望的那样,合并为一行。

If I recreate the above input table manually then the output is as we would expect, combined into a single row.

如果我运行:

update marketdata.instrument_data 
set instrument = instrument 
where instrument = 'EUX.Bund.2014M'

然后,输出再次如我们期望的那样,这令人惊讶,因为我所做的只是将工具字段设置为自身

Then again, the output is as we would expect, which is surprising as all I've done is set the instrument field to itself.

所以我只能得出结论,该外滩条目中存在一些隐藏的字符/编码问题,这会破坏交叉表。

So I can only conclude that there is some hidden character/encoding issue in that Bund entry that is breaking crosstab.

关于如何确定中断交叉表的条目有什么建议吗?

Are there any suggestions as to how I can determine what it is about that entry that breaks crosstab?

编辑:
我在尝试查看任何隐藏字符的原始表:

I ran the following on the raw table to try and see any hidden characters:

select instrument, encode(instrument::bytea, 'escape')  
from marketdata.bloomberg_future_data_temp 
where instrument = 'EUX.Bund.2014M';

然后得到:

| instrument     | encode         |
|----------------+----------------|
| EUX.Bund.2014M | EUX.Bund.2014M |


推荐答案

两个问题。

手册:


在实践中,SQL查询应始终指定 ORDER BY 1,2 ,以确保输入行的正确排序,即具有相同的值将row_name 放在一起并在行中正确排序。

In practice the SQL query should always specify ORDER BY 1,2 to ensure that the input rows are properly ordered, that is, values with the same row_name are brought together and correctly ordered within the row.

单参数形式为 crosstab() ORDER BY 1,2 是必要的。

With the one-parameter form of crosstab(), ORDER BY 1,2 would be necessary.

手册:


交叉表(文本source_sql,文本category_sql)

source_sql 是一条产生数据源集的SQL语句。

...

此语句必须返回一个 row_name 列,一个类别列,
和一个 value 列。它还可能有一个或多个额外列。
row_name 列必须是第一列。 类别列必须将
依次排列为最后两列。在 row_name
category 之间的任何列均被视为额外。对于所有具有相同 row_name 值的行,期望额外列
都是相同的。

crosstab(text source_sql, text category_sql)
source_sql is a SQL statement that produces the source set of data.
...
This statement must return one row_name column, one category column, and one value column. It may also have one or more "extra" columns. The row_name column must be first. The category and value columns must be the last two columns, in that order. Any columns between row_name and category are treated as "extra". The "extra" columns are expected to be the same for all rows with the same row_name value.

加粗强调。 一个列。似乎您想在两个列上形成组,但并不能按照您的期望进行。

Bold emphasis mine. One column. It seems like you want to form groups over two columns, which does not work as you desire.

相关答案:

  • Pivot on Multiple Columns using Tablefunc

解决方案取决于您实际要实现的目标。这不在您的问题中,您默默地假定该函数将实现您希望的功能。

The solution depends on what you actually want to achieve. It's not in your question, you silently assumed the function would do what you hope for.

I猜想您要对两个前导列进行分组:(dt,工具)。您可以使用串联或数组来玩技巧,但这会很慢和/或不可靠。我建议使用窗口函数 rank() dense_rank() 来生成每个所需组的单列唯一值。这非常便宜,因为对行进行排序是主要的成本,并且框架的顺序始终与所需的顺序相同。您可以根据需要在外部查询中删除添加的列:

I guess you want to group on both leading columns: (dt, instrument). You could play tricks with concatenating or arrays, but that would be slow and / or unreliable. I suggest a cleaner and faster approach with a window function rank() or dense_rank() to produce a single-column unique value per desired group. This is very cheap, because ordering rows is the main cost and the order of the frame is identical to the required order anyway. You can remove the added column in the outer query if desired:

SELECT dt, instrument, vol, oi
FROM   crosstab(
   $$SELECT dense_rank() OVER (ORDER BY dt, instrument) AS rnk
          , dt, instrument, field, value 
     FROM   marketdata.instrument_data 
     WHERE  field IN ('PX_VOLUME', 'OPEN_INT')
     ORDER  BY 1$$
 , $$VALUES ('PX_VOLUME'),('OPEN_INT')$$
   ) vol(rnk int, dt date, instrument text, vol numeric, oi numeric);

更多详细信息:

  • PostgreSQL Crosstab Query

这篇关于由于存在不相关的字段而导致交叉表拆分结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆