如何获得两个文件在一系列文件中具有最大差异 [英] How to get two files having max difference among a series of files

查看:125
本文介绍了如何获得两个文件在一系列文件中具有最大差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个.csv文件序列,其中包含用空格分隔的柱状(5列)数据.文件名的格式为"yyyymmdd.csv".文件格式示例如下:

I have a sequence of .csv files, which contain columnar (5 columns) data separated by a white space. Filenames are in this format 'yyyymmdd.csv'. File format for example as given below:

20161201.csv的内容

content of 20161201.csv

key value more columns (this line (header) is absent)
123456 10000 some value
123457 20000 some value
123458 30000 some value

20161202.csv的内容

content of 20161202.csv

key value more columns (this line (header) is absent)
123456 10000 some value
123457 80000 some value
123458 30000 some value

20161203.csv的内容

content of 20161203.csv

key value more columns (this line (header) is absent)
123456 50000 some value
123457 70000 some value
123458 30000 some value

我想根据值列将日期为'D'的文件与日期为'D + 1'的文件进行比较.然后,我对最大行数不同的两个连续文件感兴趣.所以像这里,如果我将20161201.csv与20161202.csv进行比较,我只会得到第二行不匹配

I want to compare a file with date 'D' to the file with date 'D+1', based on value column. Then I am interested in those two consecutive files which have maximum number of rows different. So like here if I compare 20161201.csv with 20161202.csv, I get only 2nd row mismatching

(123457 20000 some value and 123457 80000 some value, mismatched because of 20000 != 80000)

然后,如果我将20161202.csv与20161203.csv进行比较,则会发现2行不匹配(第一行和第二行)

then if I compare 20161202.csv with 20161203.csv, I get 2 rows mismatching(1st and 2nd rows)

因此,我的目标文件是20161202.csv和20161203.csv.

Hence, 20161202.csv and 20161203.csv are my target files here.

我正在寻找一系列可以执行相同操作的bash命令.

I am looking for a sequence of bash commands which can do the same.

PS:文件中的行数非常大(大约3000),您可以假设所有文件的年份和月份都相同(文件数<30).

推荐答案

如果不检查文件名是否遵守日期比较规则(数据文件与date + 1文件),则可以执行以下操作:

Without checking if the filenames respect the date comparison rule (data file vs date+1 file), you could do something like this:

while IFS= read -r -d '' fn;do files+=("$fn");done < <(find . -name '201612*.csv' -print0) 
#Load all filenames in an array. Using null separation we ensure that filenames will be  
#handled correctly no matter if they do contain spaces or other special chars.

max=0
for ((i=0;i<"${#files[@]}"-1;i++));do #iterate through the filenames array
  a="${files[i]}";b="${files[i+1]}" #compare file1 with file2, file2 with file3, etc - in series
  differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)
  echo "comparing $a vs $b - non matching lines=$differences" #Just for testing - can be removed .
  [[ "$max" -lt "$differences" ]] && max="$differences" && ahold="$a" && bhold="$b" #When we have the max differences we keep the names of the files
done

echo "max differences found=$max between $ahold and $bhold" #reporting max differences and in which files found

获取两个文件之间不匹配行的核心是grep. 您可以手动尝试grep以查看结果是否正确:

The core of getting non matching lines between two files is grep. You can try manually the grep to see that the results are correct:

grep -v -F -w -f <(cut -d' ' -f2 file1) <(cut -d' ' -f2 file2) 

grep选项:
-v:返回不匹配的行(grep的反向操作)
-F:固定-非正则表达式-匹配
-w:单词匹配避免5000与50000匹配
-f:从文件中加载模式,特别是从file1,field2中加载模式.通过这种模式,我们将grep/搜索file2的field2.
wc -l:计算匹配=不匹配的行 <(cut -d''-f2 file2):我们grep file2的field2而不是整个file2,以避免file2的除column2之外的其他列中的file1/field2可能的匹配

grep options:
-v : returns non matched lines (reverse operation of grep)
-F : fixed -not regex - match
-w : word match to avoid 5000 to match with 50000
-f : load patterns from file, in particular from file1, field2.With this pattern we will grep/search field2 of file2.
wc -l : counts the matches = non matched lines <(cut -d' ' -f2 file2) : We grep the field2 of file2 and not the whole file2 to avoid possible matches of file1/field2 in other columns of file2 than column2

您可以使用awk代替grep:

Instead of grep , you can use an awk like this:

awk 'NR==FNR{a[$2];next}!($2 in a)' file1 file2

这将打印与grep -v

file1/field2($ 2)将被加载到数组a
文件2/field2($ 2)不在此数组中的行(非匹配字段)将被打印.

file1/field2($2) will be loaded in an array a
lines of file2/field2 ($2) that are not in this array (non matching fields) will be printed.

也可以通过管道传递到|wc -l,以计算不匹配的行,如grep所示.

Can be also piped to |wc -l to count the non-matching lines , as in grep.

因此,如果您更喜欢使用awk,请使用以下行:

So if you prefer to use awk, this line :

differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)

必须更改为:

differences=$(awk 'NR==FNR{a[$2];next}!($2 in a)' $a $b |wc -l)

在任何情况下,似乎都需要一个数组来保存文件名,然后需要一个循环来循环访问文件并成对比较它们.

In any case, it seems that you need an array to hold the filenames and then you need a loop to iterate through the files and compare them in pairs.

这篇关于如何获得两个文件在一系列文件中具有最大差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆