外壳:查找跨越多个文件的匹配行 [英] Shell: Find Matching Lines Across Many Files

查看:105
本文介绍了外壳:查找跨越多个文件的匹配行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用shell脚本(以及单行)来查找大约50个文件之间的任何常见行。
编辑:注意:我正在寻找出现在所有文件中的一行(行)



到目前为止,我尝试过grep grep -v -x -f file1.sp * 它与所有其他文件中的文件内容匹配。



我也试过 grep -v -x -f file1.sp file2.sp | grep -v -x -f - file3.sp | grep -v -x -f - file4.sp | grep -v -x -f -file5.sp 等等......但是我相信使用这些文件进行的搜索将作为STD进行搜索,而不是匹配模式。



有人知道如何使用grep或其他工具来做到这一点吗?



我不介意是否需要一段时间才能运行,I我们必须在约500个文件中添加几行代码,并且希望在它们的每一个中找到一条通用行,以便在之后插入(它们最初只是来自一个文件的c& p,所以希望有一些共同的行!)



感谢您的宝贵时间, 老,bash答案(O(n);打开 2 * n 文件)

从@ mjgpy3回答,你只需要make a for循环并使用 comm ,如下所示: code>#!/ bin / bash

tmp1 =/ tmp / tmp1 $ RANDOM
tmp2 =/ tmp / tmp2 $ RANDOM

cp$ 1$ tmp1
转移
用于$ @中的文件
do
comm -1 -2$ tmp1$文件> $ tmp2
mv$ tmp2$ tmp1
完成
猫$ tmp1
rm$ tmp1

保存在 comm.sh 中,使其可执行并调用

  ./ comm.sh * .sp 

假设您的所有文件名都以 .sp 结尾。



更新后的答案python仅打开每个文件一次



查看其他答案,我想给出一个每个文件打开一次而不使用任何临时文件并支持重复行的文件。另外,让我们并行处理这些文件。



在这里你可以看到(在python3中):

 #!/ bin / env python 
import argparse
import sys
import multiprocessing
import os

EOLS = {'native':os.linesep.encode('ascii'),'unix':b'\\\
','windows':b'\r\\\
'}

def extract_set(filename):
with open(filename,'rb')as f:
return set(line.rstrip(b'\r\\\
')for line in f )

def find_common_lines(文件名):
pool = multiprocessing.Pool()
line_sets = pool.map(extract_set,filenames)
return set.intersection(* line_sets)

if __name__ =='__main__':
#使用信息和参数解析
parser = argparse.ArgumentParser()
parser.add_argument(in_files ,nargs ='+',
help =在这些文件中查找常用行)
parser.add_argumen t''out',type = argparse.FileType('wb'),
help =输出文件(默认stdout))
parser.add_argument(' - eol-style' ,options = EOLS.keys(),default ='native',
help =(default:native))
args = parser.parse_args()

#actual stuff
common_lines = find_common_lines(args.in_files)

#将结果写入输出
to_print = EOLS [args.eol_style] .join(common_lines)
如果参数。 out是None:
#找出stdout的编码,utf-8如果缺失
encoding = sys.stdout.encoding或'utf-8'
sys.stdout.write(to_print.decode(编码))
else:
args.out.write(to_print)

将它保存到 find_common_lines.py 中,然后调用

  python ./ find_common_lines.py * .sp 

更多使用信息与 - help 选项。


I am trying to use a shell script (well a "one liner") to find any common lines between around 50 files. Edit: Note I am looking for a line (lines) that appears in all the files

So far i've tried grep grep -v -x -f file1.sp * which just matches that files contents across ALL the other files.

I've also tried grep -v -x -f file1.sp file2.sp | grep -v -x -f - file3.sp | grep -v -x -f - file4.sp | grep -v -x -f - file5.sp etc... but I believe that searches using the files to be searched as STD in not the pattern to match on.

Does anyone know how to do this with grep or another tool?

I don't mind if it takes a while to run, I've got to add a few lines of code to around 500 files and wanted to find a common line in each of them for it to insert 'after' (they were originally just c&p from one file so hopefully there are some common lines!)

Thanks for your time,

解决方案

old, bash answer (O(n); opens 2 * n files)

From @mjgpy3 answer, you just have to make a for loop and use comm, like this:

#!/bin/bash

tmp1="/tmp/tmp1$RANDOM"
tmp2="/tmp/tmp2$RANDOM"

cp "$1" "$tmp1"
shift
for file in "$@"
do
    comm -1 -2 "$tmp1" "$file" > "$tmp2"
    mv "$tmp2" "$tmp1"
done
cat "$tmp1"
rm "$tmp1"

Save in a comm.sh, make it executable, and call

./comm.sh *.sp 

assuming all your filenames end with .sp.

Updated answer, python, opens only each file once

Looking at the other answers, I wanted to give one that opens once each file without using any temporary file, and supports duplicated lines. Additionally, let's process the files in parallel.

Here you go (in python3):

#!/bin/env python
import argparse
import sys
import multiprocessing
import os

EOLS = {'native': os.linesep.encode('ascii'), 'unix': b'\n', 'windows': b'\r\n'}

def extract_set(filename):
    with open(filename, 'rb') as f:
        return set(line.rstrip(b'\r\n') for line in f)

def find_common_lines(filenames):
    pool = multiprocessing.Pool()
    line_sets = pool.map(extract_set, filenames)
    return set.intersection(*line_sets)

if __name__ == '__main__':
    # usage info and argument parsing
    parser = argparse.ArgumentParser()
    parser.add_argument("in_files", nargs='+', 
            help="find common lines in these files")
    parser.add_argument('--out', type=argparse.FileType('wb'),
            help="the output file (default stdout)")
    parser.add_argument('--eol-style', choices=EOLS.keys(), default='native',
            help="(default: native)")
    args = parser.parse_args()

    # actual stuff
    common_lines = find_common_lines(args.in_files)

    # write results to output
    to_print = EOLS[args.eol_style].join(common_lines)
    if args.out is None:
        # find out stdout's encoding, utf-8 if absent
        encoding = sys.stdout.encoding or 'utf-8'
        sys.stdout.write(to_print.decode(encoding))
    else:
        args.out.write(to_print)

Save it into a find_common_lines.py, and call

python ./find_common_lines.py *.sp

More usage info with the --help option.

这篇关于外壳:查找跨越多个文件的匹配行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆