如何使用Unix Join获得外部联接中的所有字段? [英] How to get all fields in outer join with Unix join?
问题描述
假设我有两个文件en.csv
和sp.csv
,每个文件都恰好包含两个逗号分隔的记录:
Suppose that I have two files, en.csv
and sp.csv
, each containing exactly two comma-separated records:
en.csv
:
1,dog,red,car
3,cat,white,boat
sp.csv
:
2,conejo,gris,tren
3,gato,blanco,bote
如果我执行
join -t, -a 1 -a 2 -e MISSING en.csv sp.csv
我得到的输出是:
1,dog,red,car
2,conejo,gris,tren
3,cat,white,boat,gato,blanco,bote
请注意,所有缺少的字段均已折叠.要获得适当的"完全外部联接,我需要指定一种格式;因此
Notice that all the missing fields have been collapsed. To get a "proper" full outer join, I need to specify a format; thus
join -t, -a 1 -a 2 -e MISSING -o 0,1.2,1.3,1.4,2.2,2.3,2.4 en.csv sp.csv
收益
1,dog,red,car,MISSING,MISSING,MISSING
2,MISSING,MISSING,MISSING,conejo,gris,tren
3,cat,white,boat,gato,blanco,bote
这种产生完整外部联接的方式的缺点是,需要显式指定最终表的格式,这在编程应用程序中可能不容易做到(其中联接表的标识仅在以下位置是已知的)运行时).
One drawback of this way to produce a full outer join is that one needs to explicitly specify the format of the final table, which may not be easy to do in programmatic applications (where the identity of the joined tables is known only at runtime).
GNU join
的最新版本通过支持特殊格式auto
消除了此缺点.因此,使用这样的join
版本,可以用更通用的
Recent versions of GNU join
eliminate this shortcoming by supporting the special format auto
. Therefore, with such a version of join
the last command above could be replaced by the far more general
join -t, -a 1 -a 2 -e MISSING -o auto en.csv sp.csv
如何使用不支持-o auto
选项的join
版本达到相同的效果?
How can I achieve this same effect with versions of join
that do not support the -o auto
option?
背景和详细信息
我有一个Unix shell(zsh)脚本,该脚本旨在处理多个CSV平面文件,并且通过使用GNU join
的'-o auto'选项广泛地使用 来实现.我需要修改此脚本,以便它可以在可用的join
命令不支持-o auto
选项的环境中工作(BSD join
以及旧版本的GNU join
就是这种情况)
I have a Unix shell (zsh) script that is designed to processes several CSV flatfiles, and does so by making extensive use of GNU join
's '-o auto' option. I need to modify this script so that it can work in environments where the available join
command does not support the -o auto
option (as is the case for BSD join
as well as for older versions of GNU join
).
此选项在脚本中的典型用法如下:
A typical use of this option in the script is something like:
_reccut () {
cols="1,$1"
shift
in=$1
shift
if (( $# > 0 )); then
join -t, -a 1 -a 2 -e 'MISSING' -o auto \
<( cut -d, -f $cols $in | sort -t, -k1 ) \
<( _reccut "$@" )
else
cut -d, -f $cols $in | sort -t, -k1
fi
}
我显示此示例以说明很难用显式格式替换-o auto
,因为要在运行时才知道以这种格式包含的字段.
I show this example to illustrate that it would be difficult to replace -o auto
with an explicit format, since the fields to include in this format are not known until runtime.
上面的函数_reccut
基本上从文件中提取列,并将结果表沿其第一列联接.要查看_reccut
的运行情况,请想象一下,除了上面提到的文件之外,我们还拥有文件
The function _reccut
above basically extracts columns from files, and joins the resulting tables along their first column. To see how _reccut
in action, imagine that, in addition to the files mentioned above, we also had the file
de.csv
2,Kaninchen,Grau,Zug
1,Hund,Rot,Auto
然后,例如,要同时显示en.csv
的第3列,sp.csv
的第2和4列以及de.csv的第3列,将运行:
Then, for example, to display side-by-side column 3 of en.csv
, columns 2 and 4 of sp.csv
, and column 3 of de.csv one would run:
% _reccut 3 en.csv 2,4 sp.csv 3 de.csv | cut -d, 2-
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING
推荐答案
以下是一种可能适用于您的数据的解决方案.它通过按行号对齐csv文件中的记录来解决该问题,即记录2
最终出现在行2
上,记录3123
出现在行号3123
上,依此类推.缺少的记录/行用MISSING
字段填充,因此输入文件将被修改为如下所示:
Here is a solution that might or might not work for your data. It approaches the problem by aligning the records within a csv file by line number, i.e. record 2
ends up on line 2
, record 3123
on line number 3123
and so on. Missing records/lines are padded with MISSING
fields, so the input files would be mangled to look like this:
en.csv
:
1,dog,red,car
2,MISSING,MISSING,MISSING
3,cat,white,boat
de.csv
:
1,Hund,Rot,Auto
2,Kaninchen,Grau,Zug
3,MISSING,MISSING,MISSING
sp.csv
:
1,MISSING,MISSING,MISSING
2,conejo,gris,tren
3,gato,blanco,bote
从那里很容易切出感兴趣的列,并使用paste
并排打印.
From there it is easy to cut out the columns of interest and just print them side-by-side using paste
.
要实现此目的,我们首先对输入文件进行 排序,然后应用一些愚蠢的awk
魔术:
To achieve this, we sort the input files first and then apply some stupid awk
magic:
- 如果一条记录出现在其预期的行号上,请打印
- 否则,打印尽可能多的行,其中包含期望的行数(这基于文件中第一行的字段数,与
join -o auto
的行为相同)MISSING
字段,直到再次正确对齐为止 - 并非所有输入文件都将具有相同数量的记录,因此将在所有文件之前搜索最大值.然后,打印更多具有
MISSING
字段的行,直到达到最大值为止.
- If a record appears on their expected line number, print it
- Otherwise, print as many lines containing the number of expected (this is based on the number of fields of the first line in the file, same as what
join -o auto
does)MISSING
fields until the alignment is correct again - Not all input files are going to the same number of records, so the maximum is searched for before all of this. Then, more lines with
MISSING
fields are printed until the maximum is hit.
reccut.sh
:
#!/bin/bash
get_max_recnum()
{
awk -F, '{ if ($1 > max) { max = $1 } } END { print max }' "$@"
}
align_by_recnum()
{
sort -t, -k1 "$1" \
| awk -F, -v MAXREC="$2" '
NR==1 { for(x = 1; x < NF; x++) missing = missing ",MISSING" }
{
i = NR
if (NR < $1)
{
while (i < $1)
{
print i++ missing
}
NR+=i
}
}1
END { for(i++; i <= MAXREC; i++) { print i missing } }
'
}
_reccut()
{
local infiles=()
local args=( $@ )
for arg; do
infiles+=( "$2" )
shift 2
done
MAXREC="$(get_max_recnum "${infiles[@]}")" __reccut "${args[@]}"
}
__reccut()
{
local cols="$1"
local infile="$2"
shift 2
if (( $# > 0 )); then
paste -d, \
<(align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}) \
<(__reccut "$@")
else
align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}
fi
}
_reccut "$@"
运行
$ ./reccut.sh 3 en.csv 2,4 sp.csv 3 de.csv
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING
这篇关于如何使用Unix Join获得外部联接中的所有字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!