如何获得两个文件在一系列文件中具有最大差异 [英] How to get two files having max difference among a series of files
问题描述
我有一个.csv文件序列,其中包含用空格分隔的柱状(5列)数据.文件名的格式为"yyyymmdd.csv".文件格式示例如下:
I have a sequence of .csv files, which contain columnar (5 columns) data separated by a white space. Filenames are in this format 'yyyymmdd.csv'. File format for example as given below:
20161201.csv的内容
content of 20161201.csv
key value more columns (this line (header) is absent)
123456 10000 some value
123457 20000 some value
123458 30000 some value
20161202.csv的内容
content of 20161202.csv
key value more columns (this line (header) is absent)
123456 10000 some value
123457 80000 some value
123458 30000 some value
20161203.csv的内容
content of 20161203.csv
key value more columns (this line (header) is absent)
123456 50000 some value
123457 70000 some value
123458 30000 some value
我想根据值列将日期为'D'的文件与日期为'D + 1'的文件进行比较.然后,我对最大行数不同的两个连续文件感兴趣.所以像这里,如果我将20161201.csv与20161202.csv进行比较,我只会得到第二行不匹配
I want to compare a file with date 'D' to the file with date 'D+1', based on value column. Then I am interested in those two consecutive files which have maximum number of rows different. So like here if I compare 20161201.csv with 20161202.csv, I get only 2nd row mismatching
(123457 20000 some value and 123457 80000 some value, mismatched because of 20000 != 80000)
然后,如果我将20161202.csv与20161203.csv进行比较,则会发现2行不匹配(第一行和第二行)
then if I compare 20161202.csv with 20161203.csv, I get 2 rows mismatching(1st and 2nd rows)
因此,我的目标文件是20161202.csv和20161203.csv.
Hence, 20161202.csv and 20161203.csv are my target files here.
我正在寻找一系列可以执行相同操作的bash命令.
I am looking for a sequence of bash commands which can do the same.
PS:文件中的行数非常大(大约3000),您可以假设所有文件的年份和月份都相同(文件数<30).
推荐答案
如果不检查文件名是否遵守日期比较规则(数据文件与date + 1文件),则可以执行以下操作:
Without checking if the filenames respect the date comparison rule (data file vs date+1 file), you could do something like this:
while IFS= read -r -d '' fn;do files+=("$fn");done < <(find . -name '201612*.csv' -print0)
#Load all filenames in an array. Using null separation we ensure that filenames will be
#handled correctly no matter if they do contain spaces or other special chars.
max=0
for ((i=0;i<"${#files[@]}"-1;i++));do #iterate through the filenames array
a="${files[i]}";b="${files[i+1]}" #compare file1 with file2, file2 with file3, etc - in series
differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)
echo "comparing $a vs $b - non matching lines=$differences" #Just for testing - can be removed .
[[ "$max" -lt "$differences" ]] && max="$differences" && ahold="$a" && bhold="$b" #When we have the max differences we keep the names of the files
done
echo "max differences found=$max between $ahold and $bhold" #reporting max differences and in which files found
获取两个文件之间不匹配行的核心是grep. 您可以手动尝试grep以查看结果是否正确:
The core of getting non matching lines between two files is grep. You can try manually the grep to see that the results are correct:
grep -v -F -w -f <(cut -d' ' -f2 file1) <(cut -d' ' -f2 file2)
grep选项:
-v:返回不匹配的行(grep的反向操作)
-F:固定-非正则表达式-匹配
-w:单词匹配避免5000与50000匹配
-f:从文件中加载模式,特别是从file1,field2中加载模式.通过这种模式,我们将grep/搜索file2的field2.
wc -l:计算匹配=不匹配的行
<(cut -d''-f2 file2):我们grep file2的field2而不是整个file2,以避免file2的除column2之外的其他列中的file1/field2可能的匹配
grep options:
-v : returns non matched lines (reverse operation of grep)
-F : fixed -not regex - match
-w : word match to avoid 5000 to match with 50000
-f : load patterns from file, in particular from file1, field2.With this pattern we will grep/search field2 of file2.
wc -l : counts the matches = non matched lines
<(cut -d' ' -f2 file2) : We grep the field2 of file2 and not the whole file2 to avoid possible matches of file1/field2 in other columns of file2 than column2
您可以使用awk代替grep:
Instead of grep , you can use an awk like this:
awk 'NR==FNR{a[$2];next}!($2 in a)' file1 file2
这将打印与grep -v
file1/field2($ 2)将被加载到数组a
文件2/field2($ 2)不在此数组中的行(非匹配字段)将被打印.
file1/field2($2) will be loaded in an array a
lines of file2/field2 ($2) that are not in this array (non matching fields) will be printed.
也可以通过管道传递到|wc -l
,以计算不匹配的行,如grep所示.
Can be also piped to |wc -l
to count the non-matching lines , as in grep.
因此,如果您更喜欢使用awk,请使用以下行:
So if you prefer to use awk, this line :
differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)
必须更改为:
differences=$(awk 'NR==FNR{a[$2];next}!($2 in a)' $a $b |wc -l)
在任何情况下,似乎都需要一个数组来保存文件名,然后需要一个循环来循环访问文件并成对比较它们.
In any case, it seems that you need an array to hold the filenames and then you need a loop to iterate through the files and compare them in pairs.
这篇关于如何获得两个文件在一系列文件中具有最大差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!