在Linux中是否有任何命令可以基于多列进行模糊匹配 [英] Is there any command to do fuzzy matching in Linux based on multiple columns

查看:85
本文介绍了在Linux中是否有任何命令可以基于多列进行模糊匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个csv文件. 文件1

I have two csv file. File 1

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0

文件2

PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018

我要做的是使用人行横道文件File 2,基于列FNAME,MNAME,LNAME,GENDER和DOB在文件1中撤消那些观测值的PID.由于文件1观察中的相应信息不完整,因此我正在考虑使用模糊匹配来尽可能多地回退其PID(当然应该考虑电平精度).例如,文件1中具有FNAME"Paul"和LNAME"Row"的观测值应分配相同的PID,因为文件2中只有一个相似的观测值.但是对于FNAME"Marc"和LNAME"Robert"的观测值,应该为Marc,MS,Robert,M,2000,201211.0分配PID"S1",Marc,H,Robert,M,2000,201211.0 PID"S0"和Marc,M,Robert,M,,201211.0为"S0"或"S1".

What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1".

由于我想在保持高精度的同时尽可能多地补偿文件1的PID,因此我考虑了三个步骤.首先,使用命令确保仅当FNAME,MNAME,LNAME,GENDER和DOB中的信息完全匹配时,才能为文件1中的观察值分配PID.输出应为

Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,
3,Marc,H,Robert,M,2000,201211.0,
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,
3,David,M,Lee,,1990,201211.0,
5,Paul,ABC,Row,F,2008,201211.0,
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,
4,66,J,Rock,,1995,201211.0,

接下来,编写另一条命令以确保在DOB信息完全相同的同时,对FNAME,MNAME,LNAME,GENDER使用模糊匹配来回退文件1的观测值的PID,这在第一步中未识别.因此,通过这两个步骤的输出应该是

Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2

在最后一步中,使用新命令对所有相关列(即FNAME,MNAME,LNAME,GENDER和DOB)进行模糊匹配,以补偿剩余观测值的PID.因此,最终输出应为

In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2

我需要保持文件1观察的顺序,因此它必须是某种外部连接.因为我的原始数据大小约为100Gb,所以我想使用Linux来解决我的问题. 但是我不知道如何通过awk或Linux中的任何其他命令来完成最后两个步骤.有谁能帮我一个忙吗?谢谢.

I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue. But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.

推荐答案

下面是GNU awk的镜头(使用PROCINFO["sorted_in"]选择最合适的候选人).它会对每个字段的file2字段值进行哈希处理,并将PID附加到该值,例如field[2]["66M"]="S2",并且对于file1中的每条记录,计数PID匹配项的数量并打印计数最大的一个:

Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count:

BEGIN {
    FS=OFS=","
    PROCINFO["sorted_in"]="@val_num_desc"
}
NR==FNR {                                                      # file2
    for(i=1;i<=6;i++)                                          # fields 1-6
        if($i!="") {
        field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
    }
    next
}
{                                                               # file1
        for(i=1;i<=6;i++) {                                     # fields 1-6
            if($i in field[i]) {                                # if value matches
                split(field[i][$i],t,FS)                        # get PIDs
                for(j in t) {                                   # and
                    matches[t[j]]++                             # increase PID counts
                }
            } else {                                            # if no value match
                for(j in field[i])                              # for all field values
                    if($i~j || j~$i)                            # "go fuzzy" :D
                        matches[field[i][j]]+=0.5               # fuzzy is half a match
            }
        }
        for(i in matches) {                                     # the best match first
            print $0,i
            delete matches
            break                                               # we only want the best match
        }
}

输出:

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2

这里的模糊匹配"是朴素的if($i~j || j~$i),但是可以用任何近似匹配算法替换它,例如,互联网上有Levenshtein距离算法的一些实现.罗塞塔似乎有一个.

The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one.

您没有提到file2的大小,但是如果它超出了内存容量,您可能要考虑以某种方式拆分文件.

You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow.

更新:一个将file1字段映射到file2字段的版本(如注释中所述):

Update: A version that maps file1 fields to file2 fields (as mentioned in comments):

BEGIN {
    FS=OFS=","
    PROCINFO["sorted_in"]="@val_num_desc"
    map[1]=1                                                   # map file1 fields to file2 fields
    map[2]=3
    map[3]=4
    map[4]=2
    map[5]=5
    map[7]=6
}
NR==FNR {                                                      # file2
    for(i=1;i<=6;i++)                                          # fields 1-6
        if($i!="") {
        field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
    }
    next
}
{                                                              # file1
    for(i in map) {
        if($i in field[map[i]]) {                              # if value matches
            split(field[map[i]][$i],t,FS)                      # get PIDs
            for(j in t) {                                      # and
                matches[t[j]]++                                # increase PID counts
            }
        } else {                                               # if no value match
            for(j in field[map[i]])                            # for all field values
                if($i~j || j~$i)                               # "go fuzzy" :D
                    matches[field[map[i]][j]]+=0.5             # fuzzy is half a match
        }
    }
    for(i in matches) {                                        # the best match first
        print $0,i
        delete matches
        break                                                  # we only want the best match
    }
}

这篇关于在Linux中是否有任何命令可以基于多列进行模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆