排除联接中的重复字段 [英] excluding duplicate fields in a join
问题描述
我有一个正在分析的数据集.事实证明,可以轻松地通过人口统计和社区数据来充实它,从而大大改善了分析结果.
为此,在进行分析之前,我将加入人口统计和社区数据.我需要从核心样本集中排除一些字段,所以我的联接看起来像这样:
select sampledata.c1,
sampledata.c2,
demographics.*,
community.*
from sample data
join demographics using (zip)
join community using (fips)
这使我的分析引擎无法处理输出中的多个zip或fips列.我无法手动指定每个字段-充实表最终导致数百列.
我可以选择*,但是我会从样本数据中得到所有不需要的列.
如何在不重复字段的情况下加入我的浓缩数据,同时仍从样本表中选择所需的列?
我曾经想过,如果postgres(我的数据库)可以完全限定输出中的每一列(例如sample.c1,demographics.c1等),我对此将感到非常满意.
SQL中没有列排除语法,只有列包含语法(通过所有列的*运算符,或显式列出列名称). /p>
仅生成所需列的列表
但是,您可以使用架构表和数据库的某些内置函数来生成包含数百个列名的SQL语句,减去不需要的几个重复列.
SELECT
'SELECT sampledata.c1, sampledata.c2, ' || ARRAY_TO_STRING(ARRAY(
SELECT 'demographics' || '.' || column_name
FROM information_schema.columns
WHERE table_name = 'demographics'
AND column_name NOT IN ('zip')
UNION ALL
SELECT 'community' || '.' || column_name
FROM information_schema.columns
WHERE table_name = 'community'
AND column_name NOT IN ('fips')
), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)'
AS statement
这仅打印出该语句,不执行它.然后,您只需复制结果并运行它即可.
如果您想一次性动态生成和运行语句,则可以在.但是,双引号别名会导致额外的复杂性(区分大小写等);因此,我改用下划线字符将表名与别名中的列名分开,然后可以将别名与常规列名一样对待.
I have a dataset I'm doing analysis on. It turns out it can easily be enriched with demographic and community data which vastly improves the analytical results.
In order to do this I'm joining in demographic and community data before doing analysis. I need to exclude some fields from my core sample set, so my join looks something like this:
select sampledata.c1,
sampledata.c2,
demographics.*,
community.*
from sample data
join demographics using (zip)
join community using (fips)
This gets me multiple zip or fips columns in the output which my analysis engine can't deal with. I can't specify each field by hand - the enrichment tables result in hundreds of columns in the end.
I could do select *, but then I'd have all the columns from my sample data which I don't want.
How can I join in my enrichment data without duplicating fields, whilst still selecting the columns I want from my sample table?
One thought I had, was if postgres (my database) could fully qualify each column in the output (like sample.c1, demographics.c1, etc) I would be perfectly happy with this.
There is no column exclusion syntax in SQL, there is only column inclusion syntax (via the * operator for all columns, or listing the column names explicitly).
Generate list of only columns you want
However, you could generate the SQL statement with its hundreds of column names, minus the few duplicate columns you do not want, using schema tables and some built-in functions of your database.
SELECT
'SELECT sampledata.c1, sampledata.c2, ' || ARRAY_TO_STRING(ARRAY(
SELECT 'demographics' || '.' || column_name
FROM information_schema.columns
WHERE table_name = 'demographics'
AND column_name NOT IN ('zip')
UNION ALL
SELECT 'community' || '.' || column_name
FROM information_schema.columns
WHERE table_name = 'community'
AND column_name NOT IN ('fips')
), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)'
AS statement
This only prints out the statement, it does not execute it. Then you just copy the result and run it.
If you want to both generate and run the statement dynamically in one go, then you may read up on how to run dynamic SQL in the PostgreSQL documentation.
Prepend column names with table name
Alternately, this generates a select list of all the columns, including those with duplicate data, but then aliases them to include the table name of each column as well.
SELECT
'SELECT ' || ARRAY_TO_STRING(ARRAY(
SELECT table_name || '.' || column_name || ' AS ' || table_name || '_' || column_name
FROM information_schema.columns
WHERE table_name in ('sampledata', 'demographics', 'community')
), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)'
AS statement
Again, this only generates the statement. If you want to both generate and run the statement dynamically, then you'll need to brush up on dynamic SQL execution for your database, otherwise just copy and run the result.
If you really want a dot separator in the column aliases, then you'll have to use double-quoted aliases such as SELECT table_name || '.' || column_name || ' AS "' || table_name || '.' || column_name || '"'
. However, double-quoted aliases can cause extra complications (case-sensitivity, etc); so, I used the underscore character instead to separate the table name from the column name within the alias, and the aliases can then be treated like regular column names else-wise.
这篇关于排除联接中的重复字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!