基于字符串的匹配部分有效地加入/合并 [英] Efficiently joining/merging based on matching part of a string

查看:40
本文介绍了基于字符串的匹配部分有效地加入/合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图根据第一个表中的字符串是否包含在第二个表的长字符串的一部分中来连接两个表.我在 SAS 中使用 PROC SQL,但也可以使用数据步骤而不是 SQL 查询.

I am trying to join two tables based on whether or not a string from the first table is contained in part of a long string in the second table. I am using PROC SQL in SAS, but could also use a data step instead of a SQL query.

此代码在较小的数据集上运行良好,但很快就会陷入困境,因为它必须进行大量比较.如果它是一个简单的相等性检查就好了,但是必须使用 index() 函数使它变得困难.

This code works fine on smaller datasets, but rapidly gets bogged down since it has to make a ton of comparisons. It would be fine if it were a simple equality check, but having to use the index() function makes it tough.

proc sql noprint;
  create table matched as
  select A.*, B.* 
  from  search_notes as B,
        names as A
  where index(B.notes,A.first) or 
        index(B.notes,A.last)
  order by names.name, notes.id;
quit;
run;

B.notes 是一个 2000 个字符(有时完全填充)的文本块,我正在寻找包含来自 A 的名字或姓氏的任何结果.

B.notes is a 2000 character (sometimes fully populated) block of text, and I am looking for any result that contains either the first or last name from A.

我认为分两步做不会有任何速度优势,因为它已经必须将 A 的每一行与 B 的每一行进行比较(因此检查名字和姓氏都不是瓶颈).

I don't think I get any speed advantage from doing it in two steps since it already has to compare every line of A with every line of B (so checking for both the first and last name isn't the bottleneck).

当我运行它时,我得到 注意:此查询的执行涉及执行一个或多个无法优化的笛卡尔积连接. 在我的日志中.使用 A=4000 观测值和 B=100,000 观测值运行它需要 30 分钟才能产生约 1000 个匹配项.

When I run it, I get NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. in my log. Running it with A=4000 observations and B=100,000 observations takes 30 minutes to produce ~1000 matches.

有什么办法可以优化这个吗?

Is there any way to optimize this?

推荐答案

这是一个部分答案,使其运行速度提高了 4-5 倍,但并不理想(对我来说有帮助,但不一定有效在优化笛卡尔积连接的一般情况下).

This is a partial answer that makes it run 4-5X faster, but it isn't ideal (it helps in my case, but wouldn't necessarily work in the general case of optimizing a Cartesian product join).

我最初有 4 个单独的 index() 语句,就像在我的示例中一样(我的简化示例有 2 个用于 A.first 和 A.last).

I originally had 4 separate index() statements like in my example (my simplified sample had 2 for A.first and A.last).

我能够将所有 4 个 index() 语句(加上我要添加的第 5 个)重构为解决相同问题的正则表达式.它不会返回相同的结果集,但我认为它实际上返回比 5 个单独索引更好的结果,因为您可以指定单词边缘.

I was able to refactor all 4 of those index() statements (plus a 5th I was going to add) into a regular expression that solves the same problem. It won't return an identical result set, but I think it actually returns better results than the 5 separate indexes since you can specify word edges.

在清理名称以进行匹配的数据步骤中,我创建了以下模式:

In the datastep where I clean the names for matching, I create the following pattern:

pattern = cats('/\b(',substr(upcase(first_name),1,1),'|',upcase(first_name),').?\s?',upcase(last_name),'\b/');

这应该沿着 /\b(F|FIRST).?\s?LAST\b/ 行创建一个正则表达式,它将匹配 F. Last, First Last, flast@email.com 等(有些组合无法识别,但我只关心我在数据中观察到的组合).使用 '\b' 也不允许 FLAST 恰好与单词的开头/结尾相同(例如Edward Lo"与Eloquent"匹配),我发现使用 index() 很难避免这种情况

This should create a regex along the lines of /\b(F|FIRST).?\s?LAST\b/ which will match anything like F. Last, First Last, flast@email.com, etc (there are combinations that it doesn't pick up, but I was only concerned with combinations that I observe in my data). Using '\b' also doesn't allow things where FLAST happens to be the same as the start/end of a word (such as "Edward Lo" getting matched to "Eloquent") which I find hard to avoid with index()

然后我像这样执行我的 sql 连接:

Then I do my sql join like this:

proc sql noprint;
create table matched as
  select  B.*, 
          prxparse(B.pattern) as prxm, 
          A.* 
  from  search_text as A,
        search_names as B
  where prxmatch(calculated prxm,A.notes)
  order by A.id;
quit;
run;

能够为 B 中的每个名称编译一次正则表达式,然后在 A 中的每段文本上运行它似乎比几个索引语句快得多(不确定正则表达式与单个索引的情况).

Being able to compile the regex once per name in B, and then run it on each piece of text in A seems to be dramatically faster than a couple of index statements (not sure about the case of a regex vs a single index).

使用 A=250,000 Obs 和 B=4,000 Obs 运行它,index() 方法花费了大约 90 分钟的 CPU 时间,而使用 prxmatch() 执行相同的操作只花费了 20 分钟的 CPU 时间.

Running it with A=250,000 Obs and B=4,000 Obs, took something like 90 minutes of CPU time for the index() method, while doing the same with prxmatch() took only 20 minutes of CPU time.

这篇关于基于字符串的匹配部分有效地加入/合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆