将数据与r中的部分匹配合并 [英] merge data with partial match in r

查看:20
本文介绍了将数据与r中的部分匹配合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集

datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh",
 "Kish", "Lissp", "Kcn", "CCCa"),
 number1 = c(1, 8, 9,  2,  18, 25, 33,   8))
#-----------
    name number1
1 regular       1
2  kklmin       8
3   notSo       9
4   Jijoh       2
5    Kish      18
6   Lissp      25
7     Kcn      33
8    CCCa       8

 datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp",
 "KcN", "CaPN"),
   number2 = c(2, 8, 12,    13, 20, 18,   13))
#-------------
   name number2
1 reGulr       2
2   ntSo       8
3  Jijoh      12
4   sean      13
5  LiSsp      20
6    KcN      18
7   CaPN      13

我想按名称列合并它们,但是允许部分匹配(以避免妨碍合并大型数据集中的拼写错误,甚至检测此类拼写错误),例如

I want to merge them by name column, however with partial match is allowed (to avoid hampering merging spelling errors in large data set and even to detect such spelling errors) and for example

(1) 如果在任意位置连续四个字母(如果字母数小于 4 则全部) - 匹配即可

(1) If consecutive four letters (all if the number of letters are less than 4) at any position - match that is fine

 ABBCD = BBCDK = aBBCD = ramABBBCD = ABB 

(2) 匹配中不区分大小写例如 ABBCD = aBbCd

(2) Case sensitivity is off in the match e.g ABBCD = aBbCd

(3) 新数据集将保留两个名称(来自 datf1 和 datf2 的名称).这样我们就可以检测该字母是否匹配(可以单独一列显示匹配多少个字母)

(3) The new dataset will have both names (names from datf1 and datf2) preserved. So that letter we can detect if the match is perfect (may a separate column with how many letter do match)

这样的合并可能吗?

datf1 <- data.frame (name = c("xxregular", "kklmin", "notSo", "Jijoh",
             "Kish", "Lissp", "Kcn", "CCCa"),
                     number1 = c(1, 8, 9,  2,  18, 25, 33,   8))
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", 
             "LiSsp", "KcN", "CaPN"),
                     number2 = c(2, 8, 12,  13, 20, 18,   13))


uglyMerge(datf1, datf2)
       name1  name2 number1 number2 matches
1  xxregular   <NA>       1      NA       0
2     kklmin   <NA>       8      NA       0
3      notSo   <NA>       9      NA       0
4      Jijoh  Jijoh       2      12       5
5       Kish   <NA>      18      NA       0
6      Lissp  LiSsp      25      20       5
7        Kcn    KcN      33      18       3
8       CCCa   <NA>       8      NA       0
9       <NA> reGulr      NA       2       0
10      <NA>   ntSo      NA       8       0
11      <NA>   sean      NA      13       0
12      <NA>   CaPN      NA      13       0

推荐答案

也许有一个简单的解决方案,但我找不到.
恕我直言,您必须自己实现这种合并.
请在下面找到一个丑陋的例子(有很大的改进空间):

Maybe there is a simple solution but I can't find any.
IMHO you have to implement this kind of merging for your own.
Please find an ugly example below (there is a lot of space for improvements):

uglyMerge <- function(df1, df2) {

    ## lower all strings to allow case-insensitive comparison
    lowerNames1 <- tolower(df1[, 1]);
    lowerNames2 <- tolower(df2[, 1]);

    ## split strings into single characters
    names1 <- strsplit(lowerNames1, "");
    names2 <- strsplit(lowerNames2, "");

    ## create the final dataframe
    mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA, 
                        number1=df1[,2], number2=NA, matches=0,
                        stringsAsFactors=FALSE);

    ## store names of dataframe2 (to remember which strings have no match)
    toMerge <- df2[, 1];

    for (i in seq(along=names1)) {
        for (j in seq(along=names2)) {
            ## set minimal match to 4 or to string length
            minMatch <- min(4, length(names2[[j]]));

            ## find single matches
            matches <- names1[[i]] %in% names2[[j]];

            ## look for consecutive matches
            r <- rle(matches);

            ## any matches found?
            if (any(r$values)) {
                ## find max consecutive match
                possibleMatch <- r$value == TRUE;
                maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1];

                ## store max conscutive match length
                maxMatch <- r$length[maxPos];

                ## to remove FALSE-POSITIVES (e.g. CCC and kcn) find 
                ## largest substring
                start <- sum(r$length[0:(maxPos-1)]) + 1;
                stop <- start + r$length[maxPos] - 1;
                maxSubStr <- substr(lowerNames1[i], start, stop);

                ## all matching criteria fulfilled
                isConsecutiveMatch <- maxMatch >= minMatch &&
                                    grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) &&
                                    nchar(maxSubStr) > 0;

                if (isConsecutiveMatch) {
                    ## merging
                    mergedDf[i, "matches"] <- maxMatch
                    mergedDf[i, "name2"] <- as.character(df2[j, 1]);
                    mergedDf[i, "number2"] <- df2[j, 2];

                    ## don't append this row to mergedDf because already merged
                    toMerge[j] <- NA;

                    ## stop inner for loop here to avoid possible second match
                    break;
                }
            }
        } 
    }

    ## append not matched rows to mergedDf
    toMerge <- which(df2[, 1] == toMerge);
    df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]), 
                    number1=NA, number2=df2[toMerge, 2], matches=0, 
                    stringsAsFactors=FALSE);
    mergedDf <- rbind(mergedDf, df2);

    return (mergedDf);
}

输出:

> uglyMerge(datf1, datf2)
    name1  name2 number1 number2 matches
1  xxregular reGulr       1       2       5
2     kklmin   <NA>       8      NA       0
3      notSo   <NA>       9      NA       0
4      Jijoh  Jijoh       2      12       5
5       Kish   <NA>      18      NA       0
6      Lissp  LiSsp      25      20       5
7        Kcn    KcN      33      18       3
8       CCCa   <NA>       8      NA       0
9       <NA>   ntSo      NA       8       0
10      <NA>   sean      NA      13       0
11      <NA>   CaPN      NA      13       0

这篇关于将数据与r中的部分匹配合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆