如何在正则表达式上联接表 [英] How to join tables on regex

查看:166
本文介绍了如何在正则表达式上联接表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有两个用于消息的表msg和用于移动网络代码的mnc. 他们没有任何关系.但是我想加入他们

Say I have two tables msg for messages and mnc for mobile network codes. They share no relations. But I want to join them

SELECT msg.message,
    msg.src_addr,
    msg.dst_addr,
    mnc.name,
FROM "msg"
JOIN "mnc"
ON array_to_string(regexp_matches(msg.src_addr || '+' || msg.dst_addr, '38(...)'), '') = mnc.code

但是查询失败,并显示错误:

But query fails with error:

psql:marketing.sql:28: ERROR:  argument of JOIN/ON must not return a set
LINE 12: ON array_to_string(regexp_matches(msg.src_addr || '+' || msg...

有没有办法进行这种加入?还是我走错路了?

Is there a way to do such join? Or am I moving wrong way?

推荐答案

正如@Milen已经提到的,regexp_matches()可能是您所用的错误函数.您需要一个简单的正则表达式匹配(~) .实际上, LIKE运算符(~~)将是更快:

As @Milen already mentioned regexp_matches() is probably the wrong function for your purpose. You want a simple regular expression match (~). Actually, the LIKE operator (~~) will be faster:

SELECT msg.message
      ,msg.src_addr
      ,msg.dst_addr
      ,mnc.name
FROM   mnc
JOIN   msg ON msg.src_addr ~~ ('%38' || mnc.code || '%')
           OR msg.dst_addr ~~ ('%38' || mnc.code || '%')
WHERE  length(mnc.code) = 3

此外,您只希望mnc.code个正好3个字符.

In addition, you only want mnc.code of exactly 3 characters.

可以用正则表达式编写相同的内容,但是绝对会更慢.这是一个接近原始示例的工作示例:

You could write the same with regular expressions but it will most definitely be slower. Here is a working example close to your original:

SELECT msg.message
      ,msg.src_addr
      ,msg.dst_addr
      ,mnc.name
FROM   mnc
JOIN   msg ON (msg.src_addr || '+' || msg.dst_addr) ~ (38 || mnc.code)
           AND length(mnc.code) = 3

这还要求msg.src_addrmsg.dst_addrNOT NULL.

第二个查询演示了附加支票length(mnc.code) = 3如何进入JOIN条件或WHERE子句.效果一样.

The second query demonstrates how the additional check length(mnc.code) = 3 can go into the JOIN condition or a WHERE clause. Same effect here.

可以使用 regexp_matches() :

SELECT msg.message
      ,msg.src_addr
      ,msg.dst_addr
      ,mnc.name
FROM   mnc
JOIN   msg ON EXISTS (
    SELECT * 
    FROM   regexp_matches(msg.src_addr ||'+'|| msg.dst_addr, '38(...)', 'g') x(y)
    WHERE  y[1] = mnc.code
    )

但是比较起来会很慢-大概是这样.

But it will be slow in comparison - or so I assume.

说明:
您的regexp_matches()表达式仅返回 first 匹配项的所有已捕获子字符串的数组.由于仅捕获一个子字符串(模式中的一对方括号),因此您将独家获得具有一个元素的数组.

Explanation:
Your regexp_matches() expression just returns an array of all captured substrings of the first match. As you only capture one substring (one pair of brackets in your pattern), you will exclusively get arrays with one element.

通过附加的全局"开关 'g' ,您可以所有匹配项-但要多行显示.因此,您需要一个子选择来测试所有(或汇总)它们.将其放在EXISTS-半联接中,您便会找到想要的东西.

You get all matches with the additional "globally" switch 'g' - but in multiple rows. So you need a sub-select to test them all (or aggregate). Put that in an EXISTS - semi-join and you arrive at what you wanted.

也许您可以通过所有这三个指标的性能测试来进行报告? 为此,请使用 EXPLAIN ANALYZE .

Maybe you can report back with a performance test of all three? Use EXPLAIN ANALYZE for that.

这篇关于如何在正则表达式上联接表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆