一列中的模糊匹配行与下一列中的同一行 [英] Fuzzy match row in one column with same row in next column

查看:107
本文介绍了一列中的模糊匹配行与下一列中的同一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在另一列的基础上找到信息.因此,我在一列中有一些单词,而在另一列中有完整的句子.我想知道它是否在那些句子中找到了单词.但是有时单词是不一样的,所以我不能使用SQL like函数.因此,我认为模糊匹配+某种喜欢"功能会有所帮助,因为数据看起来像这样:

I would like to find information in one column based on the other column. So I have some words in one column and complete sentences in another. I would like to know whether it finds the words in those sentences. But sometimes the words are not the same so I cannot use the SQL like function. Thus I think fuzzy matching + some sort of 'like' function would be helpful as the data looks like this:

Names                    Sentences
Airplanes Sarl           Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.               100% ownership of Kidco.Ltd. is the mother company.
Popsi Co.                Cola Inc. is 50% share of PopsiCo which is part of LaLo.

数据大约有2,000行,需要逻辑来确定句子中是否确实包含Airars Sarl,Kidco Ltd.的句子也为'Kidco.Ltd'.

The data has about 2,000 rows which need a logic to find whether Airplanes Sarl is indeed in the sentence or not, and it also goes for Kidco Ltd. which is in the sentence as 'Kidco.Ltd'.

为简单起见,我不需要它来搜索列中的所有句子,只需要查找单词Kidco Ltd.并在数据框的同一行中搜索它即可.

To simplify matters, I do not need it to search for ALL sentences in the column, it only needs to look for the word Kidco Ltd. and search for it in the same row of the dataframe.

我已经用以下方法在Python中尝试过: df.apply(lambda s:fuzz.ratio(s ['Names'],s ['Sentences']),axis = 1)

I have already tried it in Python with: df.apply(lambda s: fuzz.ratio(s['Names'], s['Sentences']), axis=1)

但是我遇到了很多unicode/ascii错误,所以我放弃了,想尝试使用R. 关于如何在R中进行此操作有什么建议吗?我已经看到关于Stackoverflow的答案,该答案会模糊匹配列中的所有句子,这与我想要的有所不同.有什么建议吗?

But I got a lot of unicode /ascii errors so I gave up and would like to try in R. Any suggestions on how to go about this in R? I have seen answers on Stackoverflow that would fuzzy match all sentences in the column, which is different from what I want. Any suggestions?

推荐答案

也许尝试标记化+语音匹配:

Maybe try tokenization + phonetic matching:

library(RecordLinkage)
library(quanteda)
df <- read.table(header=T, sep=";", text="
Names                    ;Sentences
Airplanes Sarl           ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.               ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.               ;100% ownership of Kidco.Ltd. is the mother company.
Popsi Co.                ;Cola Inc. is 50% share of PopsiCo which is part of LaLo.
Popsi Co.                ;Cola Inc. is 50% share of Popsi Co which is part of LaLo.")
f <- soundex
tokens <- tokenize(as.character(df$Sentences), ngrams = 1:2) # 2-grams to catch "Popsi Co"
tokens <- lapply(tokens, f)
mapply(is.element, soundex(df$Names), tokens)
 # A614  K324  K324  P122  P122 
 # TRUE FALSE  TRUE  TRUE  TRUE 

这篇关于一列中的模糊匹配行与下一列中的同一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆