AWK或Gawk的做数据匹配和合并 [英] Awk or Gawk to do data matching and merging

查看:479
本文介绍了AWK或Gawk的做数据匹配和合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

相关问题: http://stackoverflow.com/posts/18164848

输入文件input.txt中是一个制表符分隔的UNI code TXT以

The input file input.txt is a tab delimited unicode txt with

a  A   e  f  m
b  B   g  h
c  C   i  j
b  B   k  l

欲由第一和第二列,以匹配和合并。所以我想用output.txt的

I want to match by the first and second column and merge. So I want to get output.txt with

a  A   e  f  m
b  B   g  h     k  l
c  C   i  j

的code的检测在输入的最大列数。因为它是5在本实施例中,K L从第6列放

The code has to detect the maximum number of columns in the input. Since it is 5 in this example, "k l" were put from 6th column.

其实我几乎管理时,他们都为数字要做到这一点用Matlab。但是,唉,当他们的信件,MATLAB在处理单code那么糟糕,虽然我读到有关如何在Matlab处理UNI code计算器我放弃了。所以,我现在转向蟒蛇。

Actually I almost managed to do this using Matlab when they are all numbers. But oh, when they were letters, Matlab was so bad at handling unicode, although I read stackoverflow about how to deal with unicode in Matlab I gave up. So I now turned to python.

在Nirk http://stackoverflow.com/posts/18164848 回应说,下面一行就行了。

Nirk at http://stackoverflow.com/posts/18164848 responded that the following line will do.

awk的-F \\ t'{a = $ 1\\ t的$ 2; $ 1 = $ 2 =; X [A] = X [A] $ 0} END {为(Y的X)打印Y,X [Y]}

awk -F\t '{a=$1 "\t" $2; $1=$2=""; x[a] = x[a] $0} END {for(y in x) print y,x[y]}'

然而,这code似乎没有指定输入和输出文件。

However this code doesn't seem to specify input and output file.

推荐答案

AWK是基于管道linux命令。为了养活输入文件并获得输出,你可以这样做:
awk的-F \\ t'{a = $ 1\\ t的$ 2; $ 1 = $ 2 =; X [A] = X [A] $ 0} END {为(Y的X)打印Y,X [Y]}'< INPUT.TXT> OUTPUT.TXT

awk is pipe-based linux command. To feed input file and get output, you can do like this: awk -F\t '{a=$1 "\t" $2; $1=$2=""; x[a] = x[a] $0} END {for(y in x) print y,x[y]}' < INPUT.TXT > OUTPUT.TXT

然而,awk程序上面难以匹配你需要什么的code的检测在输入的最大列数。由于它是5在本实施例中,KL,从第6列放。

However, the awk program above can hardly match what you need "The code has to detect the maximum number of columns in the input. Since it is 5 in this example, "k l" were put from 6th column.".

您可以试试这个Python程序:

You can try this python program:

max_value_fields = 0
values = dict()

with file("input.txt") as f:
    keys = []
    for line in f:
        line    = line.strip()
        fs      = line.split('\t')

        key = '%s\t%s' % (fs[0], fs[1])
        if key not in values:
            values[key] = list()
            keys.append(key)
        values[key].append(fs[2:])

        value_fields = len(fs) - 2
        if value_fields > max_value_fields:
            max_value_fields = value_fields

with file("output.txt", 'w+') as f:
    for key in keys:
        fields = [key]
        for value_list in values[key]:
            fields.extend([value for value in value_list])
            fields.extend(['' for i in xrange(max_value_fields - len(value_list))])
        print >> f, '\t'.join(fields)

这篇关于AWK或Gawk的做数据匹配和合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆