排除联接中的重复字段 [英] excluding duplicate fields in a join

查看:94
本文介绍了排除联接中的重复字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个正在分析的数据集.事实证明,可以轻松地通过人口统计和社区数据来充实它,从而大大改善了分析结果.

为此,在进行分析之前,我将加入人口统计和社区数据.我需要从核心样本集中排除一些字段,所以我的联接看起来像这样:

select sampledata.c1, 
       sampledata.c2, 
       demographics.*, 
       community.* 
from sample data 
    join demographics using (zip) 
    join community using (fips)

这使我的分析引擎无法处理输出中的多个zip或fips列.我无法手动指定每个字段-充实表最终导致数百列.

我可以选择*,但是我会从样本数据中得到所有不需要的列.

如何在不重复字段的情况下加入我的浓缩数据,同时仍从样本表中选择所需的列?

我曾经想过,如果postgres(我的数据库)可以完全限定输出中的每一列(例如sample.c1,demographics.c1等),我对此将感到非常满意.

解决方案

SQL中没有列排除语法,只有列包含语法(通过所有列的*运算符,或显式列出列名称). /p>

仅生成所需列的列表

但是,您可以使用架构表和数据库的某些内置函数来生成包含数百个列名的SQL语句,减去不需要的几个重复列.

SELECT
    'SELECT sampledata.c1, sampledata.c2, ' || ARRAY_TO_STRING(ARRAY(
        SELECT 'demographics' || '.' || column_name
        FROM information_schema.columns
        WHERE table_name = 'demographics' 
        AND column_name NOT IN ('zip')
        UNION ALL
        SELECT 'community' || '.' || column_name
        FROM information_schema.columns
        WHERE table_name = 'community' 
        AND column_name NOT IN ('fips')
    ), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)'
AS statement

这仅打印出该语句,不执行它.然后,您只需复制结果并运行它即可.

如果您想一次性动态生成和运行语句,则可以在.但是,双引号别名会导致额外的复杂性(区分大小写等);因此,我改用下划线字符将表名与别名中的列名分开,然后可以将别名与常规列名一样对待.

I have a dataset I'm doing analysis on. It turns out it can easily be enriched with demographic and community data which vastly improves the analytical results.

In order to do this I'm joining in demographic and community data before doing analysis. I need to exclude some fields from my core sample set, so my join looks something like this:

select sampledata.c1, 
       sampledata.c2, 
       demographics.*, 
       community.* 
from sample data 
    join demographics using (zip) 
    join community using (fips)

This gets me multiple zip or fips columns in the output which my analysis engine can't deal with. I can't specify each field by hand - the enrichment tables result in hundreds of columns in the end.

I could do select *, but then I'd have all the columns from my sample data which I don't want.

How can I join in my enrichment data without duplicating fields, whilst still selecting the columns I want from my sample table?

One thought I had, was if postgres (my database) could fully qualify each column in the output (like sample.c1, demographics.c1, etc) I would be perfectly happy with this.

解决方案

There is no column exclusion syntax in SQL, there is only column inclusion syntax (via the * operator for all columns, or listing the column names explicitly).

Generate list of only columns you want

However, you could generate the SQL statement with its hundreds of column names, minus the few duplicate columns you do not want, using schema tables and some built-in functions of your database.

SELECT
    'SELECT sampledata.c1, sampledata.c2, ' || ARRAY_TO_STRING(ARRAY(
        SELECT 'demographics' || '.' || column_name
        FROM information_schema.columns
        WHERE table_name = 'demographics' 
        AND column_name NOT IN ('zip')
        UNION ALL
        SELECT 'community' || '.' || column_name
        FROM information_schema.columns
        WHERE table_name = 'community' 
        AND column_name NOT IN ('fips')
    ), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)'
AS statement

This only prints out the statement, it does not execute it. Then you just copy the result and run it.

If you want to both generate and run the statement dynamically in one go, then you may read up on how to run dynamic SQL in the PostgreSQL documentation.

Prepend column names with table name

Alternately, this generates a select list of all the columns, including those with duplicate data, but then aliases them to include the table name of each column as well.

SELECT
    'SELECT ' || ARRAY_TO_STRING(ARRAY(
        SELECT table_name || '.' || column_name || ' AS ' || table_name || '_' || column_name
        FROM information_schema.columns
        WHERE table_name in ('sampledata', 'demographics', 'community')
    ), ',') || ' FROM sampledata JOIN demographics USING (zip) JOIN community USING (fips)'
AS statement

Again, this only generates the statement. If you want to both generate and run the statement dynamically, then you'll need to brush up on dynamic SQL execution for your database, otherwise just copy and run the result.

If you really want a dot separator in the column aliases, then you'll have to use double-quoted aliases such as SELECT table_name || '.' || column_name || ' AS "' || table_name || '.' || column_name || '"'. However, double-quoted aliases can cause extra complications (case-sensitivity, etc); so, I used the underscore character instead to separate the table name from the column name within the alias, and the aliases can then be treated like regular column names else-wise.

这篇关于排除联接中的重复字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆