打印文件中未排序的字符串之间的差异 [英] Print differences between not sorted strings from files

查看:68
本文介绍了打印文件中未排序的字符串之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件,每个文件包含n行,每行都有一个字符串.我想打印出这些列表之间的字符差异.您可以将操作想象为一种字母的减法".它应该是这样的:

I have two files that contain n lines with a string in each line. I want to print out the difference in characters between those lists. You could imagine the operation as a sort of "Subtraction" of letters. This is how it should look like:

List1       List2      Result
AaBbCcDd    AaCcDd     Bb
AaBbCcE     AaBbCc     E
AaBbCcF     AaCcF      Bb

这意味着第二个列表不是按字母顺序排序的,但是所有要删除的子字符串都在每个字符串中排序( Aa Bb 之前在 Cc之前).请注意,要删除的元素可以是1个或2个字符长( Aa F ),始终以大写字母开头(有时)以小写字母开头.字符串完全由一些元素"(例如 Aa Bb Cc Dd E F Gg 等.

Which means that the second list is not sorted alphabetically, but all the substrings to remove are sorted within each string (Aa comes before Bb comes before Cc). Note that the elements to remove can be either 1 or 2 characters long (Aa or F), always starting with uppercase letters followed (sometimes) by a lowercased letter. The strings are completely composed of permutations of a few "elements" like Aa, Bb, Cc, Dd, E, F, Gg, ... and so on.

此问题的回答方式非常相似: Bash脚本找出两个字符串之间的差异,但仅适用于手动输入的两个字符串,而我需要执行数百次操作.我正在努力将文件实现为该命令的源,同时还要正确分隔字符.这是我的改编:

This question has been answered in very similar form here: Bash script Find difference between two strings, but only for two strings entered manually, whereas I need to do the operation many hundreds of times. I am struggling with implementing files as a source to this command while also separating the characters correctly. Here is my adaptation:

split_chars() { sed $'s/./&\\\n/g' <<< "$1"; }
comm -23 <(split_chars AaBbCcDd) <(split_chars AaCcDd)

给出输出

B
b

即使在这种情况下,也仍然不是我想要的.我猜想 split_chars 命令是这里的关键,但是我无法以任何方式将其应用于我的文件.将文件名放在方括号中显然不起作用.作为参考,一个简单的

so still not quite what I want even in this single case. I guess that the split_chars command is the key here but I was not able to apply it to my files in any way. Putting the file names inside the brackets does not work obviously. For reference, a simple

commm -23 List1 List2

只是导致

AaBbCcDd
AaBbCcEe
AaBbCcF
comm: file 2 is not in sorted order

推荐答案

由于您不想拆分字符,而是要以大写字母开头的子字符串,因此应使用以下函数替换 split_chars .

Since you don't want to split characters but substrings starting with an uppercase letter you should replace split_chars with the following function.

split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }

使用 tr -d \\ n 删除所有换行符可以撤消拆分行.

Splitting a line can be undone by deleting all newline characters using tr -d \\n.

要从另一个行列表中减去行列表,可以使用 grep 而不进行排序.

To subtract a list of lines from another list of lines you can use grep without having to sort.

grep -vFxf subtrahend minuend

这将以原始顺序打印文件 minuend 中那些不在文件 subtrahend 中的行.

This will print in original order those lines from file minuend which are not in file subtrahend.

要将所有内容放在一起,您必须

To put everything together, you have to

  • 并行读取两个文件
  • 将每个字符串分成几行
  • 减去这些列表
  • 撤消拆分

这是一个简化的版本,假定您的输入文件仅包含描述格式的行并且具有相同的长度.

Here is a simplified version assuming your input files contain only lines of the described format and have the same length.

split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }
subtract() { grep -vFxf "$2" "$1"; }
union() { tr -d \\n; echo; }
paste List1 List2 | while read -r minuend subtrahend; do
    subtract <(split "$minuend") <(split "$subtrahend") | union
done

带有循环的Bash脚本很慢.如果您需要更快的解决方案,则应使用更高级的语言(例如 perl python )重写此脚本.

Bash scripts with loops are slow. If you need a faster solution you should rewrite this script in a more advanced language like perl or python.

这篇关于打印文件中未排序的字符串之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆