高效的字符串相似度分组 [英] Efficient string similarity grouping

查看:67
本文介绍了高效的字符串相似度分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

设置:我有关于人及其父母姓名的数据,我想找到兄弟姐妹(父母姓名相同的人).

 pdata<-data.frame(parents_name=c(彼得潘+玛塔管家",彼得潘+玛塔管家",阿明·多尔格纳 + 简·约翰娜面团",杰克杰克逊+其他人"))

此处的预期输出将是一列,表明前两个观察值属于 X 族,而第三和第四列分别属于一个单独的族.例如:

person_id parents_name family_id1彼得潘+玛塔管家",12彼得潘+玛塔管家",13阿明·多尔格纳+简·约翰娜面团",24杰克杰克逊+其他人"3

目前的方法:我对距离度量很灵活.目前,我使用 Levenshtein 编辑距离来匹配 obs,允许两个字符的差异.但其他变体(例如最大公共子字符串")如果运行速度更快就可以了.

对于较小的子样本,我在循环中使用 stringdist::stringdiststringdist::stringdistmatrix,但随着样本量的增加,这变得越来越低效.

一旦使用了特定的样本量,矩阵版本就会爆炸.我非常低效的循环尝试在这里:

#使用随机姓氏创建相同复杂度的数据#(每个父母 4mio obs 和 ~1-3 个孩子)pdata<-data.frame(parents_name=paste0(rep(c("peter pan + marta ",彼得潘+玛尔塔",阿明·多格纳 + 简·约翰娜",杰克杰克逊 + 某人"),1e6),stringi::stri_rand_strings(4e6, 5)))for (i in 1:nrow(pdata)) {similar_fatersname0<-stringdist::stringdist(pdata$parents_name[i],pdata$parents_name[i:nrow(pdata)],nthread=4)<2#[创建分组指标]}

我的问题:应该有显着的效率提升,例如因为一旦我发现字符串在更容易评估的东西上有足够的不同,我就可以停止比较字符串,例如.字符串长度,或第一个单词.字符串长度变体已经有效并将复杂性降低了约 3 倍.但这太少了.感谢任何减少计算时间的建议.

备注:

  • 字符串实际上是 unicode 而不是拉丁字母 (Devnagari)
  • 已完成删除未使用字符等的预处理

解决方案

有两个挑战:

A.Levenstein 距离的并行执行 - 而不是顺序循环

B.比较次数:如果我们的源列表有 400 万个条目,理论上我们应该运行 16 万亿个 Levenstein 距离度量,这是不现实的,即使我们解决了第一个挑战.

为了让我的语言使用更清晰,这里是我们的定义

  • 我们想要测量表达式之间的 Levenstein 距离.
  • 每个表达式都有两部分,父 A 全名和父 B 全名用加号分隔
  • 部分的顺序很重要(即,如果表达式 1 的父 A = 表达式 2 的父 A 和父 B 或表达式 1= 表达式 2 的父 B,则两个表达式 (1, 2) 是相同的.将不考虑表达式相同,如果表达式 1 的父代 A = 表达式 2 的父代 B 和表达式 1 的父代 B = 表达式 2 的父代 A)
  • 节(或全名)是一系列单词,由空格或破折号分隔,对应一个人的名字和姓氏
  • 我们假设一个部分的最大单词数为 6(您的示例有 2 或 3 个单词的部分,我假设我们最多可以有 6 个)一个部分中的单词顺序很重要(该部分总是名字后跟姓氏,而不是姓氏在前,例如,Jack John 和 John Jack 是两个不同的人).
  • 有 400 万个表达式
  • 表达式被假定为只包含英文字符.可以忽略数字、空格、标点符号、破折号和任何非英文字符
  • 我们假设简单的匹配已经完成(如精确的表达式匹配),我们不必搜索精确的匹配

技术上的目标是在 400 万个表达式列表中找到一系列匹配的表达式.如果两个表达式的 Levenstein 距离小于 2,则认为它们是匹配表达式.

实际上我们创建了两个列表,它们是最初的 400 万个表达式列表的精确副本.然后我们称之为左列表和右列表.在复制列表之前,每个表达式都被分配了一个表达式 id.我们的目标是在 Right 列表中找到与 Left 列表的条目的 Levenstein 距离小于 2 的条目,不包括相同的条目(相同的表达式 id).

我建议采用两步法来分别解决这两个挑战.第一步将减少可能匹配表达式的列表,第二步将简化 Levenstein 距离测量,因为我们只查看非常接近的表达式.使用的技术是任何传统的数据库服务器,因为我们需要索引数据集以提高性能.

挑战 A

挑战 A 包括减少距离测量的次数.我们从最多大约开始.16 万亿(400 万的 2 次方),我们不应该超过几千万或几亿.此处使用的技术包括在完整表达式中搜索至少一个相似的词.根据数据的分布方式,这将大大减少可能的匹配对的数量.或者,根据结果所需的准确度,我们也可以搜索至少有两个相似词或至少一半相似词的对.

从技术上讲,我建议将表达式列表放在表格中.添加标识列以创建每个表达式的唯一 ID,并创建 12 个字符的列.然后解析表达式并将每个部分的每个单词放在单独的列中.这看起来像(我没有代表所有 12 列,但想法如下):

|id |表达 |sect_a_w_1 |sect_a_w_2 |sect_b_w_1 |sect_b_w_2 ||1 |彼得潘+玛塔管家|彼得 |平底锅|玛尔塔 |管家 |

有空列(因为有 12 个单词的表达式很少)但没关系.

然后我们复制表并在每个 sect... 列上创建一个索引.我们运行 12 个连接,试图找到相似的词,比如

SELECT L.id, R.idFROM 左表 L JOIN 右表 TON L.sect_a_w_1 = R.sect_a_w_1AND L.id <>标识符

我们收集 12 个临时表中的输出并运行 12 个表的联合查询,以获取所有表达式的简短列表,这些表达式具有至少一个相同单词的潜在匹配表达式.这是我们挑战 A 的解决方案.我们现在有一个最可能匹配对的简短列表.该列表将包含数百万条记录(成对的 Left 和 Right 条目),但不会包含数十亿条记录.

挑战 B

挑战 B 的目标是批量处理简化的 Levenstein 距离(而不是循环运行).首先,我们应该同意什么是简化的 Levenstein 距离.首先我们同意两个表达式的 levenstein 距离是两个表达式中具有相同索引的所有单词的 levenstein 距离之和.我的意思是两个表达式的 Levenstein 距离是它们两个第一个词的距离,加上它们两个第二个词的距离,等等.其次,我们需要发明一个简化的 Levenstein 距离.我建议使用 n-gram 方法,只使用 2 个字符的克数,这些字符的索引绝对差异小于 2 .

例如peter 和 pieter 之间的距离计算如下

彼得1 = pe2 = et3 = te4 = 呃5 = r_彼得1 = 圆周率2 = 即3 = 等4 = te5 = 呃6 = r_

Peter 和 Pieter 有 4 个共同的 2-gram,索引绝对差小于 2 'et','te','er','r_'.两个单词中最大的一个有 6 个可能的 2-gram,距离为 6-4 = 2 - Levenstein 距离也为 2,因为有一个 'eter' 移动和一个字母插入 'i'.

这是一个近似值,不适用于所有情况,但我认为在我们的情况下它会很好用.如果我们对结果的质量不满意,我们可以尝试使用 3 克或 4 克或允许大于 2 克的序列差异.但是这个想法是每对执行的计算比传统的 Levenstein 算法少得多.

然后我们需要将其转换为技术解决方案.我之前所做的是以下内容:首先隔离单词:由于我们只需要测量单词之间的距离,然后将每个表达式的这些距离相加,我们可以通过在单词列表上运行一个不同的选择来进一步减少计算次数(我们已经准备好了列表上一节中的词).

这种方法需要一个映射表来记录表达id、段id、词id和词的词序列号,以便在过程结束时计算原始表达距离.

然后我们有一个更短的新列表,它包含所有与 2-gram 距离度量相关的单词的交叉连接.然后我们要批量处理这个 2-gram 距离测量,我建议在 SQL join 中进行.这需要一个预处理步骤,包括创建一个新的临时表,该表将每 2-gram 存储在单独的行中 - 并跟踪单词 Id、单词序列和部分类型

从技术上讲,这是通过使用一系列(或循环)子字符串选择对单词列表进行切片来完成的,就像这样(假设单词列表 - 有两个副本,一个左一个右 - 包含 2 列 word_id 和字):

INSERT INTO left_gram_table (word_id, gram_seq, gram)SELECT word_id, 1 AS gram_seq, SUBSTRING(word,1,2) AS gramFROM left_word_table

然后

INSERT INTO left_gram_table (word_id, gram_seq, gram)SELECT word_id, 2 AS gram_seq, SUBSTRING(word,2,2) AS gramFROM left_word_table

使管家"看起来像这样的东西(假设单词 id 是 152)

<代码>|pk |word_id |gram_seq |克 ||1 |152 |1 |圣 ||2 |152 |2 |特 ||3 |152 |3 |哎||4 |152 |4 |哇 ||5 |152 |5 |阿尔||6 |152 |6 |路 ||7 |152 |7 |d_ |

不要忘记在 word_id、gram 和 gram_seq 列上创建索引,距离可以通过左右 gram 列表的 join 来计算,其中 ON 看起来像

ON L.gram = R.gramAND ABS(L.gram_seq + R.gram_seq)<2AND L.word_id <>R.word_id

距离是两个词中最长的词的长度减去匹配的克数.SQL 进行此类查询的速度非常快,我认为一台配备 8 gig 内存的简单计算机可以在合理的时间范围内轻松完成几亿行.

然后只需加入映射表计算每个表达式中单词到单词距离的总和,即可得到总表达式到表达式距离.

Setting: I have data on people, and their parent's names, and I want to find siblings (people with identical parent names).

 pdata<-data.frame(parents_name=c("peter pan + marta steward",
                                 "pieter pan + marta steward",
                                 "armin dolgner + jane johanna dough",
                                 "jack jackson + sombody else"))

The expected output here would be a column indicating that the first two observations belong to family X, while the third and fourth columns are each in a separate family. E.g:

person_id    parents_name                           family_id
1            "peter pan + marta steward",           1
2            "pieter pan + marta steward",          1
3            "armin dolgner + jane johanna dough",  2
4            "jack jackson + sombody else"          3

Current approach: I am flexible regarding the distance metric. Currently, I use Levenshtein edit-distance to match obs, allowing for two-character differences. But other variants such as "largest common sub string" would be fine if they run faster.

For smaller subsamples I use stringdist::stringdist in a loop or stringdist::stringdistmatrix, but this is getting increasingly inefficient as sample size increases.

The matrix version explodes once a certain sample size is used. My terribly inefficient attempt at looping is here:

#create data of the same complexity using random last-names
#(4mio obs and ~1-3 kids per parents) 
pdata<-data.frame(parents_name=paste0(rep(c("peter pan + marta ",
                                "pieter pan + marta ",
                                "armin dolgner + jane johanna ",
                                "jack jackson + sombody "),1e6),stringi::stri_rand_strings(4e6, 5)))

for (i in 1:nrow(pdata)) {
  similar_fatersname0<-stringdist::stringdist(pdata$parents_name[i],pdata$parents_name[i:nrow(pdata)],nthread=4)<2
  #[create grouping indicator]
}

My question: There should be substantial efficiency gains, e.g. because I could stop comparing strings once I found them to sufficiently different in something that is easier to assess, eg. string length, or first word. The string length variant already works and reduces complexity by a factor ~3. But thats by far too little. Any suggestions to reduce computation time are appreciated.

Remarks:

  • The strings are actually in unicode and not in the Latin alphabet (Devnagari)
  • Pre-processing to drop unused characters etc is done

解决方案

There are two challenges:

A. The parallel execution of Levenstein distance - instead of a sequential loop

B. The number of comparisons: if our source list has 4 million entries, theoretically we should run 16 trillion of Levenstein distance measures, which is unrealistic, even if we resolve the first challenge.

To make my use of language clear, here are our definitions

  • we want to measure the Levenstein distance between expressions.
  • every expression has two sections, the parent A full name and the parent B full name which are separated by a plus sign
  • the order of the sections matters (i.e. two expressions (1, 2) are identical if Parent A of expression 1 = Parent A of expression 2 and Parent B or expression 1= Parent B of expression 2. Expressions will not be considered identical if Parent A of expression 1 = Parent B of expression 2 and Parent B of expression 1 = Parent A of expression 2)
  • a section (or a full name) is a series of words, which are separated by spaces or dashes and correspond to the the first name and last name of a person
  • we assume the maximum number of words in a section is 6 (your example has sections of 2 or 3 words, I assume we can have up to 6) the sequence of words in a section matters (the section is always a first name followed by a last name and never the last name first, e.g. Jack John and John Jack are two different persons).
  • there are 4 million expressions
  • expressions are assumed to contain only English characters. Numbers, spaces, punctuation, dashes, and any non-English character can be ignored
  • we assume the easy matches are already done (like the exact expression matches) and we do not have to search for exact matches

Technically the goal is to find series of matching expressions in the 4-million expressions list. Two expressions are considered matching expression if their Levenstein distance is less than 2.

Practically we create two lists, which are exact copies of the initial 4-million expressions list. We call then the Left list and the Right list. Each expression is assigned an expression id before duplicating the list. Our goal is to find entries in the Right list which have a Levenstein distance of less than 2 to entries of the Left list, excluding the same entry (same expression id).

I suggest a two step approach to resolve the two challenges separately. The first step will reduce the list of the possible matching expressions, the second will simplify the Levenstein distance measurement since we only look at very close expressions. The technology used is any traditional database server because we need to index the data sets for performance.

CHALLENGE A

The challenge A consists of reducing the number of distance measurements. We start from a maximum of approx. 16 trillion (4 million to the power of two) and we should not exceed a few tens or hundreds of millions. The technique to use here consists of searching for at least one similar word in the complete expression. Depending on how the data is distributed, this will dramatically reduce the number of possible matching pairs. Alternatively, depending on the required accuracy of the result, we can also search for pairs with at least two similar words, or with at least half of similar words.

Technically I suggest to put the expression list in a table. Add an identity column to create a unique id per expression, and create 12 character columns. Then parse the expressions and put each word of each section in a separate column. This will look like (I have not represented all the 12 columns, but the idea is below):

|id | expression | sect_a_w_1 | sect_a_w_2 | sect_b_w_1 |sect_b_w_2 |
|1 | peter pan + marta steward | peter | pan | marta |steward      |

There are empty columns (since there are very few expressions with 12 words) but it does not matter.

Then we replicate the table and create an index on every sect... column. We run 12 joins which try to find similar words, something like

SELECT L.id, R.id 
FROM left table L JOIN right table T 
ON L.sect_a_w_1 = R.sect_a_w_1
AND L.id <> R.id 

We collect the output in 12 temp tables and run an union query of the 12 tables to get a short list of all expressions which have a potential matching expressions with at least one identical word. This is the solution to our challenge A. We now have a short list of the most likely matching pairs. This list will contain millions of records (pairs of Left and Right entries), but not billions.

CHALLENGE B

The goal of challenge B is to process a simplified Levenstein distance in batch (instead of running it in a loop). First we should agree on what is a simplified Levenstein distance. First we agree that the levenstein distance of two expressions is the sum of the levenstein distance of all the words of the two expressions which have the same index. I mean the Levenstein distance of two expressions is the distance of their two first words, plus the distance of their two second words, etc. Secondly, we need to invent a simplified Levenstein distance. I suggest to use the n-gram approach with only grams of 2 characters which have an index absolute difference of less than 2 .

e.g. the distance between peter and pieter is calculated as below

Peter       
1 = pe          
2 = et          
3 = te          
4 = er
5 = r_           

Pieter
1 = pi
2 = ie
3 = et
4 = te
5 = er
6 = r_ 

Peter and Pieter have 4 common 2-grams with an index absolute difference of less than 2 'et','te','er','r_'. There are 6 possible 2-grams in the largest of the two words, the distance is then 6-4 = 2 - The Levenstein distance would also be 2 because there's one move of 'eter' and one letter insertion 'i'.

This is an approximation which will not work in all cases, but I think in our situation it will work very well. If we're not satisfied with the quality of the results we can try with 3-grams or 4-grams or allow a larger than 2 gram sequence difference. But the idea is to execute much fewer calculations per pair than in the traditional Levenstein algorithm.

Then we need to convert this into a technical solution. What I have done before is the following: First isolate the words: since we need only to measure the distance between words, and then sum these distances per expression, we can further reduce the number of calculations by running a distinct select on the list of words (we have already prepared the list of words in the previous section).

This approach requires a mapping table which keeps track of the expression id, the section id, the word id and the word sequence number for word, so that the original expression distance can be calculated at the end of the process.

We then have a new list which is much shorter, and contains a cross join of all words for which the 2-gram distance measure is relevant. Then we want to batch process this 2-gram distance measurement, and I suggest to do it in a SQL join. This requires a pre-processing step which consists of creating a new temporary table which stores every 2-gram in a separate row – and keeps track of the word Id, the word sequence and the section type

Technically this is done by slicing the list of words using a series (or a loop) of substring select, like this (assuming the word list tables - there are two copies, one Left and one Right - contain 2 columns word_id and word) :

INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 1 AS gram_seq, SUBSTRING(word,1,2) AS gram
FROM left_word_table 

And then

INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 2 AS gram_seq, SUBSTRING(word,2,2) AS gram
FROM left_word_table 

Etc.

Something which will make "steward" look like this (assume the word id is 152)

|  pk  | word_id | gram_seq | gram | 
|  1   |  152       |  1          | st |
|  2   |  152       |  2          | te |
|  3   |  152       |  3          |  ew |
|  4   |  152       |  4          |  wa |
|  5   |  152       |  5          |  ar |
|  6   |  152       |  6          |  rd |
|  7   |  152       |  7          |  d_ |

Don't forget to create an index on the word_id, the gram and the gram_seq columns, and the distance can be calculated with a join of the left and the right gram list, where the ON looks like

ON L.gram = R.gram 
AND ABS(L.gram_seq + R.gram_seq)< 2 
AND L.word_id <> R.word_id 

The distance is the length of the longest of the two words minus the number of the matching grams. SQL is extremely fast to make such a query, and I think a simple computer with 8 gigs of RAM would easily do several hundred of million lines in a reasonable time frame.

And then it's only a matter of joining the mapping table to calculate the sum of word to word distance in every expression, to get the total expression to expression distance.

这篇关于高效的字符串相似度分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆