Unix的击删除重复行从目录文件? [英] Unix Bash Remove Duplicate Lines From Directory Files?

查看:107
本文介绍了Unix的击删除重复行从目录文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个txt文件几百目录。我需要从每个现有的文件中删除所有重复行。在整个目录中的每行应该是唯一的,无论它在文件中的,所以我需要比较和核对其他每个文件。这是可能的,而无需更改现有的文件结构,怎么办?文件名需要保持不变。

假设所有的文件都在目录中的富和目录的总大小是30MB。

我想我可以通过通讯或awk完成这一点,但我还没有找到工作的命令行做到这一点,我不熟悉的语法。

更新
我已经试过这条线,我相信职位外壳全部重复,但它不是从文件中删除重复项。

 的awk'NR == FNR {a [$ 0] =;}旁边; !(在$ 0)'TMP / *


解决方案

 的awk'{
   如果(FNR == 1){
       如果(FS = LFN和放大器;!&安培;!NR = 1){
         B〔LFN]
       };
   LFN =文件名
   };
   如果(!($ 0一)){
        一[$ 0];打印$ 0 GT;文件名;
        FS =文件名
  }
  }
结束{
    如果(FS!= LFN){
         B〔文件名]
    };
    为(ⅰb)中{
         接近(ⅰ);
         printf的(数据)I标记
    }
}'TMP / *

1条件:

 如果(!(在$ 0)){
  一[$ 0];打印$ 0 GT;文件名;
  FS =文件名
}

如果当前行$ 0阵列 A 如果不是行添加到阵列中,并正在读别人忽略该行当前文件。 FILENAME AWK内置变量给出正在读取的文件的名称。
如果在当前文件至少有一个不同的线上读发现标志 FS 文件名设置。

2ST条件:

 如果(FNR == 1){
    如果(FS = LFN和放大器;!&安培;!NR = 1){
      B〔LFN]
    };
     LFN =文件名
  }

所以,当一个文件被读 FNR == 1 FS (使用不同的线路最后一个文件)和 LFN (lastfilename)如​​果此不同,那么阵列 b 与指数相比 LFN 创建。(要碰为空文件)

1条件:

  END {
      如果(FS!= LFN){
           B〔文件名]
      };
      为(ⅰb)中{
           接近(ⅰ);
           printf的(数据)I标记
      }
  }

END ,上述条件2再次检查发现,如果最后一个文件有不同的线路。另外通过阵列 B 循环,以触摸的地方没有什么不同线路找到空文件。
在这里,我假设是没有顺序文件中读取。

这是脚本是不是最优的,但是就做好了。

I have a directory with a few hundred txt files. I need to remove all duplicate lines from each of the existing files. Every line in the entire directory should be unique regardless of the file it's in, so I need to compare and check each file against the other. Is this possible to do without altering the existing file structure? The file names need to stay the same.

Let's say all the files are in directory "foo" and the total size of the directory is 30mb.

I think I can do this through comm or awk, but I haven't found a working command line to do this and I'm unfamiliar with the syntax.

UPDATE I have tried this line which I believe posts all the duplicates in the shell, but it's not removing the duplicates from the files.

awk 'NR==FNR{a[$0]="";next}; !($0 in a)' tmp/*

解决方案

awk '{
   if(FNR==1){
       if(fs!=lfn && NR!=1){
         b[lfn]
       };
   lfn=FILENAME
   };
   if(!($0 in a)) {
        a[$0];print $0>FILENAME;
        fs=FILENAME
  }
  }
END{
    if(fs!=lfn){
         b[FILENAME]
    };
    for (i in b){
         close(i);
         printf (data) >i"
    }
}' tmp/* 

1st Condition:

if(!($0 in a)) {
  a[$0];print $0>FILENAME;
  fs=FILENAME
}

If the current line $0 is in array a if not add the line to array a and to the current file being read else ignore the line. FILENAME awk built-in variable gives the name of the file being read. If there is at least one different line in current file being read is found flag fs with FILENAME is set.

2st Condition:

  if(FNR==1){
    if(fs!=lfn && NR!=1){
      b[lfn]
    };
     lfn=FILENAME
  }

So when next file is read FNR==1 fs(last file with different line) and lfn(lastfilename) is compared if this differs then array b with index lfn is created.( To touch as empty file)

1st Condition:

  END{
      if(fs!=lfn){
           b[FILENAME]
      };
      for (i in b){
           close(i);
           printf (data) >i"
      }
  }

In the END, above condition 2 checked again to find if last file has different line. Also loops through the array b to touch empty file where no different lines are found. Here I have assumed there is no order in which file are read.

This is script is not optimal but will do the work.

这篇关于Unix的击删除重复行从目录文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆