通过第二个文件索引打印线 [英] Print lines indexed by a second file

查看:137
本文介绍了通过第二个文件索引打印线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件:


  1. 文件与字符串(新行终止)

  2. 文件与整数(每行一个)

我想从通过在第二文件中的行索引的第一文件打印行。我目前的解决办法是做到这一点。

 而读索引

    SED -n $ {指数} p $文件1
完成< $文件2

这基本上是由行读取索引文件线和运行的sed打印该特定行。的问题是,它是对大索引文件(千万。行)缓慢

是否有可能做到这一点更快?我怀疑AWK可以在这里很有用。

我所以搜索到我的最好的,但只能找到人试图打印行范围,而不是通过索引第二个文件。

更新

该指数一般不打乱。预计对出现在由索引在索引文件中定义的命令行。

例如:

文件1:

 这是1号线
这是2号线
这是3号线
这是4号线

文件2:

  3
2

预期的输出是:

 这是3号线
这是2号线


解决方案

如果我理解正确,那么

 的awk'NR == FNR {选择[$ 1] = 1;接下来}选择[FNR]'INDEXFILE数据文件

应该工作,假设该指数是按升序排序,或者你想在自己的数据文件以打印不管指数的排序方式下的线条。这种工作方式如下:

  NR == FNR {#在处理第一个文件
  选择[$ 1] = 1#记得,如果指数被视为
  接下来#和什么也不做
}
选择[FNR]#之后,选择(打印)选择的线路。

如果索引未排序并且线应以它们出现在索引的顺序进行打印:

  NR == FNR {#处理索引:
  ++计数器
  IDX [$ 0] =#柜台记得在哪个位置,你看到的
  接下来的#索引
}
FNR在IDX {#处理数据文件时:
  行[IDX [FNR] = $ 0#由的位置记得选线
}#索引
END {#,并在年底时:依次打印出来。
  对于(i = 1; I< =计数器; ++ I){
    打印行[I]
  }
}

这后可以被内联,以及(以分号 ++计数器指数[FNR] =柜台,但我很可能把它放在一个文件,说 foo.awk ,并运行的awk -f foo.awk INDEXFILE数据文件。有了一个索引文件

  1
4
3

和数据文件

 一号线
2号线
3号线
4号线

这将打印

 一号线
4号线
3号线

剩下需要说明的是,这里假设在索引中的条目是唯一的。如题,也是一个问题,你必须要记住的索引位置的列表,将它在扫描数据文件并记住每个位置的线条。这就是:

  NR == FNR {
  ++计数器
  IDX [$ 0] = IDX [$ 0]反#这里要记住名单
  下一个
}
FNR在IDX {
  斯普利特(IDX [FNR],POS)#拆分名单
  对(在POS P){
    行[POS [P]] = $ 0#并记住线
                          #在其中的所有位置。
  }
}
结束 {
  对于(i = 1; I< =计数器; ++ I){
    打印行[I]
  }
}

本,最后是问题的code的功能等价物。多么复杂,你必须去为你的使用情况是你必须做出决定。

I have two files:

  1. File with strings (new line terminated)
  2. File with integers (one per line)

I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this

while read index
do
    sed -n ${index}p $file1
done < $file2

It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).

Is it possible to do this faster? I suspect awk can be useful here.

I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.

UPDATE

The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.

EXAMPLE

File 1:

this is line 1
this is line 2
this is line 3
this is line 4

File 2:

3
2

The expected output is:

this is line 3
this is line 2

解决方案

If I understand you correctly, then

awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile

should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:

NR == FNR {         # while processing the first file
  selected[$1] = 1  # remember if an index was seen
  next              # and do nothing else
}
selected[FNR]       # after that, select (print) the selected lines.

If the index is not sorted and the lines should be printed in the order in which they appear in the index:

NR == FNR {               # processing the index:
  ++counter
  idx[$0] = counter       # remember that and at which position you saw
  next                    # the index
}
FNR in idx {              # when processing the data file: 
  lines[idx[FNR]] = $0    # remember selected lines by the position of
}                         # the index
END {                     # and at the end: print them in that order.
  for(i = 1; i <= counter; ++i) {
    print lines[i]
  }
}

This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file

1
4
3

and a data file

line1
line2
line3
line4

this will print

line1
line4
line3

The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:

NR == FNR {               
  ++counter
  idx[$0] = idx[$0] " " counter  # remember a list here
  next
}
FNR in idx {              
  split(idx[FNR], pos)    # split that list
  for(p in pos) {
    lines[pos[p]] = $0    # and remember the line for
                          # all positions in them.
  }
}
END {
  for(i = 1; i <= counter; ++i) {
    print lines[i]
  }
}

This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.

这篇关于通过第二个文件索引打印线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆