由于存在不相关的字段而导致交叉表拆分结果 [英] Crosstab splitting results due to presence of unrelated field
问题描述
我正在将postgres 9.1与tablefunc:crosstab一起使用
I'm using postgres 9.1 with tablefunc:crosstab
我有一个具有以下结构的表:
I have a table with the following structure:
CREATE TABLE marketdata.instrument_data
(
dt date NOT NULL,
instrument text NOT NULL,
field text NOT NULL,
value numeric,
CONSTRAINT instrument_data_pk PRIMARY KEY (dt , instrument , field )
)
这是由每天获取数据的脚本填充的。因此可能看起来像这样:
This is populated by a script that fetches data daily. So it might look like so:
| dt | instrument | field | value |
|------------+-------------------+-----------+-------|
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1 |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT | 2 |
然后我使用以下交叉表查询来旋转表:
I then use the following crosstab query to pivot the table:
select dt, instrument, vol, oi
FROM crosstab($$
select dt, instrument, field, value
from marketdata.instrument_data
where field = 'PX_VOLUME' or field = 'OPEN_INT'
$$::text, $$VALUES ('PX_VOLUME'),('OPEN_INT')$$::text
) vol(dt date, instrument text, vol numeric, oi numeric);
运行此命令,我得到的结果是:
Running this I get the result:
| dt | instrument | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1 | 2 |
问题:
实际运行时表中的数据,我注意到对于某些字段,该函数将结果分为两行:
The problem: When running this with lot of real data in the table, I noticed that for some fields the function was splitting the result over two rows:
| dt | instrument | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1 | |
| 2014-05-23 | SGX.MiniJGB.2014U | | 2 |
我检查了dt和instrument字段是否相同,并通过将
I checked that the dt and instrument fields were identical and produced a work-around by grouping the ouput of the crosstab.
分析
我发现,输入表中存在另一个条目会导致输出分为两行。如果我输入的内容如下:
Analysis I've discovered that it's the presence of one other entry in the input table that causes the output to be split over 2 rows. If I have the input as follows:
| dt | instrument | field | value |
|------------+-------------------+-----------+-------|
| 2014-04-23 | EUX.Bund.2014M | PX_VOLUME | 0 |
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1 |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT | 2 |
我得到:
| dt | instrument | vol | oi |
|------------+-------------------+-----+----|
| 2014-04-23 | EUX.Bund.2014M | 0 | |
| 2014-05-23 | SGX.MiniJGB.2014U | 1 | |
| 2014-05-23 | SGX.MiniJGB.2014U | | 2 |
哪里真的很奇怪...
Where it gets really weird...
如果我手动重新创建上面的输入表,那么输出将如我们期望的那样,合并为一行。
If I recreate the above input table manually then the output is as we would expect, combined into a single row.
如果我运行:
update marketdata.instrument_data
set instrument = instrument
where instrument = 'EUX.Bund.2014M'
然后,输出再次如我们期望的那样,这令人惊讶,因为我所做的只是将工具字段设置为自身
Then again, the output is as we would expect, which is surprising as all I've done is set the instrument field to itself.
所以我只能得出结论,该外滩条目中存在一些隐藏的字符/编码问题,这会破坏交叉表。
So I can only conclude that there is some hidden character/encoding issue in that Bund entry that is breaking crosstab.
关于如何确定中断交叉表的条目有什么建议吗?
Are there any suggestions as to how I can determine what it is about that entry that breaks crosstab?
编辑:
我在尝试查看任何隐藏字符的原始表:
I ran the following on the raw table to try and see any hidden characters:
select instrument, encode(instrument::bytea, 'escape')
from marketdata.bloomberg_future_data_temp
where instrument = 'EUX.Bund.2014M';
然后得到:
| instrument | encode |
|----------------+----------------|
| EUX.Bund.2014M | EUX.Bund.2014M |
推荐答案
两个问题。
在实践中,SQL查询应始终指定
ORDER BY 1,2
,以确保输入行的正确排序,即具有相同的值将row_name
放在一起并在行中正确排序。
In practice the SQL query should always specify
ORDER BY 1,2
to ensure that the input rows are properly ordered, that is, values with the samerow_name
are brought together and correctly ordered within the row.
单参数形式为 crosstab()
, ORDER BY 1,2
是必要的。
With the one-parameter form of crosstab()
, ORDER BY 1,2
would be necessary.
交叉表(文本source_sql,文本category_sql)
source_sql
是一条产生数据源集的SQL语句。
...
此语句必须返回一个row_name
列,一个类别
列,
和一个value
列。它还可能有一个或多个额外列。
row_name
列必须是第一列。类别
和值
列必须将
依次排列为最后两列。在row_name
和category
之间的任何列均被视为额外。对于所有具有相同row_name
值的行,期望额外列
都是相同的。
crosstab(text source_sql, text category_sql)
source_sql
is a SQL statement that produces the source set of data.
...
This statement must return onerow_name
column, onecategory
column, and onevalue
column. It may also have one or more "extra" columns. Therow_name
column must be first. Thecategory
andvalue
columns must be the last two columns, in that order. Any columns betweenrow_name
andcategory
are treated as "extra". The "extra" columns are expected to be the same for all rows with the samerow_name
value.
加粗强调。 一个列。似乎您想在两个列上形成组,但并不能按照您的期望进行。
Bold emphasis mine. One column. It seems like you want to form groups over two columns, which does not work as you desire.
相关答案:
- Pivot on Multiple Columns using Tablefunc
解决方案取决于您实际要实现的目标。这不在您的问题中,您默默地假定该函数将实现您希望的功能。
The solution depends on what you actually want to achieve. It's not in your question, you silently assumed the function would do what you hope for.
I猜想您要对两个前导列进行分组:(dt,工具)
。您可以使用串联或数组来玩技巧,但这会很慢和/或不可靠。我建议使用窗口函数 rank()
或 dense_rank()
来生成每个所需组的单列唯一值。这非常便宜,因为对行进行排序是主要的成本,并且框架的顺序始终与所需的顺序相同。您可以根据需要在外部查询中删除添加的列:
I guess you want to group on both leading columns: (dt, instrument)
. You could play tricks with concatenating or arrays, but that would be slow and / or unreliable. I suggest a cleaner and faster approach with a window function rank()
or dense_rank()
to produce a single-column unique value per desired group. This is very cheap, because ordering rows is the main cost and the order of the frame is identical to the required order anyway. You can remove the added column in the outer query if desired:
SELECT dt, instrument, vol, oi
FROM crosstab(
$$SELECT dense_rank() OVER (ORDER BY dt, instrument) AS rnk
, dt, instrument, field, value
FROM marketdata.instrument_data
WHERE field IN ('PX_VOLUME', 'OPEN_INT')
ORDER BY 1$$
, $$VALUES ('PX_VOLUME'),('OPEN_INT')$$
) vol(rnk int, dt date, instrument text, vol numeric, oi numeric);
更多详细信息:
- PostgreSQL Crosstab Query
这篇关于由于存在不相关的字段而导致交叉表拆分结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!