如何使用Unix Join获得外部联接中的所有字段? [英] How to get all fields in outer join with Unix join?

查看:116
本文介绍了如何使用Unix Join获得外部联接中的所有字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有两个文件en.csvsp.csv,每个文件都恰好包含两个逗号分隔的记录:

Suppose that I have two files, en.csv and sp.csv, each containing exactly two comma-separated records:

en.csv:

1,dog,red,car
3,cat,white,boat

sp.csv:

2,conejo,gris,tren
3,gato,blanco,bote

如果我执行

join -t, -a 1 -a 2 -e MISSING en.csv sp.csv

我得到的输出是:

1,dog,red,car
2,conejo,gris,tren
3,cat,white,boat,gato,blanco,bote

请注意,所有缺少的字段均已折叠.要获得适当的"完全外部联接,我需要指定一种格式;因此

Notice that all the missing fields have been collapsed. To get a "proper" full outer join, I need to specify a format; thus

join -t, -a 1 -a 2 -e MISSING -o 0,1.2,1.3,1.4,2.2,2.3,2.4 en.csv sp.csv

收益

1,dog,red,car,MISSING,MISSING,MISSING
2,MISSING,MISSING,MISSING,conejo,gris,tren
3,cat,white,boat,gato,blanco,bote

这种产生完整外部联接的方式的缺点是,需要显式指定最终表的格式,这在编程应用程序中可能不容易做到(其中联接表的标识仅在以下位置是已知的)运行时).

One drawback of this way to produce a full outer join is that one needs to explicitly specify the format of the final table, which may not be easy to do in programmatic applications (where the identity of the joined tables is known only at runtime).

GNU join的最新版本通过支持特殊格式auto消除了此缺点.因此,使用这样的join版本,可以用更通用的

Recent versions of GNU join eliminate this shortcoming by supporting the special format auto. Therefore, with such a version of join the last command above could be replaced by the far more general

join -t, -a 1 -a 2 -e MISSING -o auto en.csv sp.csv

如何使用不支持-o auto选项的join版本达到相同的效果?

How can I achieve this same effect with versions of join that do not support the -o auto option?

背景和详细信息

我有一个Unix shell(zsh)脚本,该脚本旨在处理多个CSV平面文件,并且通过使用GNU join的'-o auto'选项广泛地使用 来实现.我需要修改此脚本,以便它可以在可用的join命令不支持-o auto选项的环境中工作(BSD join以及旧版本的GNU join就是这种情况)

I have a Unix shell (zsh) script that is designed to processes several CSV flatfiles, and does so by making extensive use of GNU join's '-o auto' option. I need to modify this script so that it can work in environments where the available join command does not support the -o auto option (as is the case for BSD join as well as for older versions of GNU join).

此选项在脚本中的典型用法如下:

A typical use of this option in the script is something like:

_reccut () {
    cols="1,$1"
    shift
    in=$1
    shift
    if (( $# > 0 )); then
        join -t, -a 1 -a 2 -e 'MISSING' -o auto \
          <( cut -d, -f $cols $in | sort -t, -k1 ) \
          <( _reccut "$@" )
    else
        cut -d, -f $cols $in | sort -t, -k1
    fi
}

我显示此示例以说明很难用显式格式替换-o auto,因为要在运行时才知道以这种格式包含的字段.

I show this example to illustrate that it would be difficult to replace -o auto with an explicit format, since the fields to include in this format are not known until runtime.

上面的函数_reccut基本上从文件中提取列,并将结果表沿其第一列联接.要查看_reccut的运行情况,请想象一下,除了上面提到的文件之外,我们还拥有文件​​

The function _reccut above basically extracts columns from files, and joins the resulting tables along their first column. To see how _reccut in action, imagine that, in addition to the files mentioned above, we also had the file

de.csv

2,Kaninchen,Grau,Zug
1,Hund,Rot,Auto

然后,例如,要同时显示en.csv的第3列,sp.csv的第2和4列以及de.csv的第3列,将运行:

Then, for example, to display side-by-side column 3 of en.csv, columns 2 and 4 of sp.csv, and column 3 of de.csv one would run:

% _reccut 3 en.csv 2,4 sp.csv 3 de.csv | cut -d, 2-
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING

推荐答案

以下是一种可能适用于您的数据的解决方案.它通过按行号对齐csv文件中的记录来解决该问题,即记录2最终出现在行2上,记录3123出现在行号3123上,依此类推.缺少的记录/行用MISSING字段填充,因此输入文件将被修改为如下所示:

Here is a solution that might or might not work for your data. It approaches the problem by aligning the records within a csv file by line number, i.e. record 2 ends up on line 2, record 3123 on line number 3123 and so on. Missing records/lines are padded with MISSING fields, so the input files would be mangled to look like this:

en.csv:

1,dog,red,car
2,MISSING,MISSING,MISSING
3,cat,white,boat

de.csv:

1,Hund,Rot,Auto
2,Kaninchen,Grau,Zug
3,MISSING,MISSING,MISSING

sp.csv:

1,MISSING,MISSING,MISSING
2,conejo,gris,tren
3,gato,blanco,bote

从那里很容易切出感兴趣的列,并使用paste并排打印.

From there it is easy to cut out the columns of interest and just print them side-by-side using paste.

要实现此目的,我们首先对输入文件进行 排序,然后应用一些愚蠢的awk魔术:

To achieve this, we sort the input files first and then apply some stupid awk magic:

  • 如果一条记录出现在其预期的行号上,请打印
  • 否则,打印尽可能多的行,其中包含期望的行数(这基于文件中第一行的字段数,与join -o auto的行为相同)MISSING字段,直到再次正确对齐为止
  • 并非所有输入文件都将具有相同数量的记录,因此将在所有文件之前搜索最大值.然后,打印更多具有MISSING字段的行,直到达到最大值为止.
  • If a record appears on their expected line number, print it
  • Otherwise, print as many lines containing the number of expected (this is based on the number of fields of the first line in the file, same as what join -o auto does) MISSING fields until the alignment is correct again
  • Not all input files are going to the same number of records, so the maximum is searched for before all of this. Then, more lines with MISSING fields are printed until the maximum is hit.

reccut.sh:

#!/bin/bash

get_max_recnum()
{
    awk -F, '{ if ($1 > max) { max = $1 } } END { print max }' "$@"
}

align_by_recnum()
{
    sort -t, -k1 "$1" \
        | awk -F, -v MAXREC="$2" '
            NR==1 { for(x = 1; x < NF; x++) missing = missing ",MISSING" }
            {
                i = NR
                if (NR < $1)
                {
                    while (i < $1)
                    {
                        print i++ missing
                    }
                    NR+=i
                }
            }1
            END { for(i++; i <= MAXREC; i++) { print i missing } }
            '
}

_reccut()
{
    local infiles=()
    local args=( $@ )
    for arg; do
        infiles+=( "$2" )
        shift 2
    done
    MAXREC="$(get_max_recnum "${infiles[@]}")" __reccut "${args[@]}"
}

__reccut()
{
    local cols="$1"
    local infile="$2"
    shift 2

    if (( $# > 0 )); then
        paste -d, \
            <(align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}) \
            <(__reccut "$@")
    else
        align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}
    fi
}

_reccut "$@"

运行

$ ./reccut.sh 3 en.csv 2,4 sp.csv 3 de.csv
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING

这篇关于如何使用Unix Join获得外部联接中的所有字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆