查找两个数据帧的匹配并将答案重写为数据帧 [英] Find Match of two data frames and rewrite the answer as data frame

查看:115
本文介绍了查找两个数据帧的匹配并将答案重写为数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧,它们被清理并合并为单个csv文件,数据帧就像这样

i have two data frames which are cleaned and merged as a single csv file , the data frames are like this

  **Source                         Master**

 chang chun petrochemical      CHANG CHUN GROUP
 chang chun plastics           CHURCH AND DWIGHT CO INC
 church  dwight                CITRIX SYSTEMS ASIA PACIFIC P L
 citrix systems  pacific       CNH INDUSTRIAL N.V

现在从这些中,我必须考虑名字,并与主名称的每个名称进行核对,并找到相关的匹配项,然后将输出打印为另一个数据框.上面的数据帧很少,但是我正在使用20k值.

now from these , i have to consider the first name and check with each name of master names and find a match that is relevant and print the output as another data frame. the above data frames are few , but i am working with 20k values as such.

我的输出必须看起来像这样

My output must look like this

 **Source                         Master                         Result**

 chang chun petrochemical      CHANG CHUN GROUP                 CHANG CHUN GROUP
 chang chun plastics           CHURCH AND DWIGHT CO INC         CHANG CHUN GROUP
 church  dwight                CITRIX SYSTEMS ASIA PACIFIC P L  CHURCH AND DWIGHT CO INC
 citrix systems  pacific       CNH INDUSTRIAL N.V               CITRIX SYSTEMS ASIA PACIFIC P L

我通过此链接通过模糊匹配进行合并R 中的变量,但到目前为止没有运气..!

I tried this with possible ways with this link Merging through fuzzy matching of variables in R but , no luck so far..!

提前谢谢!!

当我将上述代码用于大量数据时,结果是-

when i use the above code for large set of data , the result is this-

使用的代码:

Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)

输出

NA NA  2  3 NA NA NA  6 NA NA  9 NA NA NA 12 NA NA NA 13 14 15 16 NA 18 19 20 21 22 NA 24 NA 26 NA 28 NA NA NA 30 NA NA 33 NA 35 36 37 NA 39 40 NA NA 43 NA 45 46 NA 48 49 50 51 52 53 54 55 56 57 58 NA
 [68] 60 61 62 NA NA NA NA 64 NA 66 67 68 69 70 71 72 73 NA 75 76 77 78 NA 79 80 81 NA 83 84 85 86 87 88

代码:

Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
   agrep(x, Master.Names$MOD,value=TRUE) })

输出:

[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] " CHURCH AND DWIGHT CO INC"

[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"

[[5]]
character(0)

即使使用for循环也不会产生结果.

and even with for loop no result is produced.

代码:

for(i in seq_len(nrow(df$ICIS_Cust_Names)))
  {
    df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
  }
  print(df$reslt)

代码2: 仅用于100行循环

for (i in 100){
  gr1$x[i] = agrep(gr1$ICIS_Cust_Names[i], gr2$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
  gr2$Y[i] = agrep(gr1$ICIS_Cust_Names[i], gr2$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}

结果:

NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

错误

Error in `$<-.data.frame`(`*tmp*`, "x", value = c(NA, NA, " church  dwight  " : 
  replacement has 3 rows, data has 100

当观察到上述结果时,考虑该结果,因为它直接与每个数据帧的行值进行检查,但是我希望它考虑Source的第一个元素并与master和拿出一根火柴,也要休息一下. 如果有人可以更正我的代码,我将不胜感激!提前致谢..!

when observed the result for above is considered , as it checks directly with the row value of each data frames , but i want it to consider first element of Source and check with all the elements of master and come up with a match , likewise for rest. I would appreciate if someone could correct my code ! thanks in advance..!

推荐答案

如果只想对照Names中的第一个单词检查Master.Names,就可以解决这个问题:

If you want to check the Master.Names only against the first word in Names, this could do the trick:

Names$Mast <- NA
for(i in seq_len(nrow(Names))) 
    Names$Mast[i] <- grep(toupper(x = strsplit(Names[i,1]," ")[[1]][1]), Master.Names$V1,value=TRUE)

修改

使用sapply而不是循环可以提高速度:

Using sapply instead of a loop could gain you some speed:

Names$Mast <- sapply(Names$V1, function(x) {
    grep(toupper(x = strsplit(x," ")[[1]][1]), Master.Names$V1,value=TRUE)
})

结果

> Names
                        V1                            Mast
1 chang chun petrochemical                CHANG CHUN GROUP
2      chang chun plastics                CHANG CHUN GROUP
3            church dwight        CHURCH AND DWIGHT CO INC
4   citrix systems pacific CITRIX SYSTEMS ASIA PACIFIC P L

数据

Master.Names <- read.csv(text="CHANG CHUN GROUP
CHURCH AND DWIGHT CO INC
CITRIX SYSTEMS ASIA PACIFIC P L
CNH INDUSTRIAL N.V", header=FALSE)

Names <- read.csv(text="chang chun petrochemical
chang chun plastics     
church dwight          
citrix systems pacific", header=FALSE)

这篇关于查找两个数据帧的匹配并将答案重写为数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆