如何比较bash中的2个范围列表? [英] How to compare 2 lists of ranges in bash?

查看:120
本文介绍了如何比较bash中的2个范围列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用bash脚本(Ubuntu 16.04)比较2个范围列表:file1中任何范围内的任何数字是否与file2中任何范围内的任何数字一致?如果是这样,请在第二个文件中打印该行.在这里,我每个范围都由2个制表符分隔的列组成(在file1中,第1行表示范围1-4,即1、2、3、4).实际文件很大.

Using bash script (Ubuntu 16.04), I'm trying to compare 2 lists of ranges: does any number in any of the ranges in file1 coincide with any number in any of the ranges in file2? If so, print the row in the second file. Here I have each range as 2 tab-delimited columns (in file1, row 1 represents the range 1-4, i.e. 1, 2, 3, 4). The real files are quite big.

文件1:

1 4
5 7 
8 11
12 15

文件2:

3 4 
8 13 
20 24

所需的输出:

3 4 
8 13

我最大的尝试是:

awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next}; 
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then            
{print $1, $2}}}}}' file1 file2 > output.txt

这将返回一个空文件.

This returns an empty file.

我认为该脚本将需要使用if-then条件进行范围比较,并遍历两个文件的每一行.我已经找到了每个概念的示例,但无法弄清楚如何将它们组合在一起.

I'm thinking that the script will need to involve range comparisons using if-then conditions and iterate through each line in both files. I've found examples of each concept, but can't figure out how to combine them.

任何帮助表示赞赏!

推荐答案

当然,这取决于文件的大小.如果它们不足以耗尽内存,则可以尝试以下100%bash解决方案:

It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:

declare -a min=() # array of lower bounds of ranges
declare -a max=() # array of upper bounds of ranges

# read ranges in second file, store then in arrays min and max
while read a b; do
    min+=( "$a" );
    max+=( "$b" );
done < file2

# read ranges in first file    
while read a b; do
    # loop over indexes of min (and max) array
    for i in "${!min[@]}"; do
        if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
            echo "${min[i]} ${max[i]}" # print range
            unset min[i] max[i]        # performance optimization
        fi
    done
done < file1

这只是一个起点.有许多可能的性能/内存占用改进.但是它们在很大程度上取决于文件的大小和范围的分布.

This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.

编辑1 :改进了范围重叠测试.

EDIT 1: improved the range overlap test.

编辑2 :重新使用了RomanPerekhrest提出的出色优化(未打印的范围从file2开始).当范围重叠的可能性很高时,性能应该更好.

EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from file2). The performance should be better when the probability that ranges overlap is high.

编辑3 :与RomanPerekhrest提出的awk版本的性能比较(在修复了最初的小错误之后):在此问题上,awkbash快10到20倍.如果性能很重要,并且您在awkbash之间犹豫不决,请选择:

EDIT 3: performance comparison with the awk version proposed by RomanPerekhrest (after fixing the initial small bugs): awk is between 10 and 20 times faster than bash on this problem. If performance is important and you hesitate between awk and bash, prefer:

awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
    { for (i in a)
          if ($1 <= b[i] && a[i] <= $2) {
              print a[i], b[i]; delete a[i]; delete b[i];
          } 
    }' file2 file1

这篇关于如何比较bash中的2个范围列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆