比较连续行的AWK /（或Python）和随机选择的重复线中的一个 [英] Compare consecutive rows in awk/(or python) and random select one of duplicate lines

查看：180 发布时间：2016/7/29 11:07:15 python-2.7 awk

本文介绍了比较连续行的AWK /（或Python）和随机选择的重复线中的一个的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想连续的行用AWK / Python的（因为我用大文件，我想preFER使用AWK）命令一个大文件（〜1GB）进行比较。这里是输入和输出的示例：

输入文件

 ＃Xÿ
1＃11注（不输入文件的一部分）
10 12＃（注* 1）
10＃17
4月14日
20 15＃（注* 2）
20 16＃
20＃17
20 22＃
5月19日
10 20

（备注* 1）：自该行和连续列/行的x值的x值是相同的，这条线或下一行（随机选择），应在OUTPUTFILE

（备注* 2）：由于该行的x值和在接下来的3行的x值是相同的，这条线或在接下来的3行中的一个（随机选择），应在OUTPUTFILE被打印

我想有输出文件是这样的：

 ＃Xÿ
1月11日
10 17
4月14日
20 17
5月19日
10 20

或（因为随机选择，如果相同的x值出现在连续行）

 ＃Xÿ
1月11日
10 12
4月14日
20 16
5月19日
10 20

基本上我想比较，如果当前行/行的x值是相同的下一个连续的行/行的x值。
如果不是，当前行应被打印。
如果是的话，只有一个的无规线应选择的连续行的/具有相同的x值的行（y值不用于比较重要）。

我希望有人能帮帮我！

解决方案

  $猫tst.awk
功能prtBuf（IDX）{
    如果（CNT大于0）{
        IDX = INT（（RAND（）* CNT）+ 1）
        打印BUF [IDX]
    }
    CNT = 0
}BEGIN {srand（）函数}
$ 1！= $ P $ {PV prtBuf（）}
{BUF [++ CNT] = $ 0; preV = $ 1}
END {prtBuf（）}$ AWK -f tst.awk文件
1＃11注（不输入文件的一部分）
10＃17
4月14日
20＃17
5月19日
10 20$ AWK -f tst.awk文件
1＃11注（不输入文件的一部分）
10 12＃（注* 1）
4月14日
20 22＃
5月19日
10 20

我以为 X 和是从你的例子列标题为输入文件实际上不是一部分，所以把他们赶走。如果他们不存在，你希望他们在输出然后只需添加一个 NR == 1 {打印;接下来} 行前面

I would like to compare consecutive rows in a big file (~1GB) using awk/python (since I use big files, I would prefer to use awk) command. Here is an example of input and output:

Input file

#x   y
1    11        # Remarks (not part of the input file)  
10   12        # (Remark *1)
10   17        #
4    14
20   15        # (Remark *2)
20   16        #
20   17        #
20   22        #
5    19
10   20

(Remark *1): since the x-value of this row and the x-value of the consecutive row/line are the same, this line or the next line (RANDOM selection) should be printed in the outputfile

(Remark *2): since the x-value of this row and the x-value of the next 3 lines are the same, this line or ONE of the next 3 lines (RANDOM selection) should be printed in the outputfile

The output file I wanted to have is like this:

or (since random selection, if the same x-values appear in consecutive rows)

Basically I want to compare if the x-value of the current line/row is the same as the x-value of the next consecutive lines/rows. If not, the current line should be printed. If yes, only one random line should be selected of the consecutive lines/rows with the same x-values (the y-values are not important for comparison).

I hope, somebody can help me!

解决方案

$ cat tst.awk
function prtBuf(        idx) {
    if (cnt > 0) {
        idx = int((rand() * cnt) + 1)
        print buf[idx]
    }
    cnt = 0
}

BEGIN { srand() }
$1 != prev { prtBuf() }
{ buf[++cnt]=$0; prev=$1 }
END { prtBuf() }

$ awk -f tst.awk file
1    11        # Remarks (not part of the input file)
10   17        #
4    14
20   17        #
5    19
10   20

$ awk -f tst.awk file
1    11        # Remarks (not part of the input file)
10   12        # (Remark *1)
4    14
20   22        #
5    19
10   20

I assumed the x and y column headers from your example weren't actually part of your input file and so removed them. If they do exist and you want them in the output then just add a NR==1{print;next} line up front.

这篇关于比较连续行的AWK /（或Python）和随机选择的重复线中的一个的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

比较连续行的AWK /（或Python）和随机选择的重复线中的一个 [英] Compare consecutive rows in awk/(or python) and random select one of duplicate lines

问题描述

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

比较连续行的AWK /（或Python）和随机选择的重复线中的一个 [英] Compare consecutive rows in awk/(or python) and random select one of duplicate lines

问题描述

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

登录关闭