比较两个不同表的两列中逗号分隔的值 [英] Comparing comma separated values from two columns of two different tables

查看:96
本文介绍了比较两个不同表的两列中逗号分隔的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想比较两列(差异表)的值,该列具有两个不同的Oracle表的逗号分隔值.我想找到与所有值都匹配的行(NAME1 所有值应该与NAME2值匹配).

I want to compare the values of two columns (diff table) having comma separated values of two different Oracle tables. I want to find rows that match with all values (NAME1 all values should match with NAME2 values).

注意:用逗号分隔的值的顺序不同.

示例:

T1:

ID_T1             NAME1
===================================


1      ASCORBIC ACID, PARACETAMOL, POTASSIUM HYDROGEN CARBONATE
2      SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID
3      CAFFEINE, PARACETAMOL PH. EUR.
4      PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE

T2:

ID_T2          NAME2
=================================

 4      POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
 5      SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS
 6      PARACETAMOL PH. EUR.,CAFFEINE
 7      CODEINE PHOSPHATE, PARACETAMOL DC
 8      DEXCHLORPHENIRAMINE MALEATE, DEXTROMETHORPHAN HYDROBROMIDE 
10      DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE

我的结果"应仅在两个表中显示基于所有名称匹配"的匹配行.

MY RESULT should only show the matching row based on ALL NAME Matches in both tables.

    ID_T1    ID_T2    MATCHING NAME
    ==================================
    1            4    POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
    3            6    PARACETAMOL PH. EUR.,CAFFEINE
    4           10    PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE

尝试过REGEXP_SUBST,但无法使其正常工作.

Tried with REGEXP_SUBST but wasn't able to make it work.

我使用下面的代码来解析值:

I used the below code to parse the values:

SELECT REGEXP_SUBSTR (NAME1, '[^,]+', 1, ROWNUM)
            FROM T1
           CONNECT BY ROWNUM <= LENGTH (NAME1) - 
           LENGTH (REPLACE (NAME, ',')) + 1

推荐答案

您可以将表转换为第一标准格式,然后比较存储在每一行中的化合物.起点可以是:

You could get the table(s) into first normal form and then compare the compounds that are stored in each row. A starting point could be:

{1}对每行进行标记,然后将标记写入新表中.为每个令牌赋予其原始ID plus 3个字母的前缀,以指示该令牌来自哪个表. {2}按ID对新(规范化")表的行进行分组,并执行LISTAGG().执行自我连接,并找到匹配的令牌组".

{1} Tokenize each row, and write the tokens into a new table. Give each token its original ID plus a 3-letter prefix, indicating which table the token came from. {2} Group the rows of the new ("normalized") table by ID, and perform a LISTAGG(). Perform a self join, and find matching "token groups".

{1}标记化,创建表为select(CTAS)

create table tokens
as 
select
  ltrim(        -- ltrim() and rtrim() remove leading/trailing spaces (blanks)
    rtrim( 
      substr( N.wrapped
      , instr( N.wrapped, ',', 1, T.pos ) + 1
      , ( instr( N.wrapped, ',', 1, T.pos + 1 ) - instr( N.wrapped, ',', 1, T.pos ) ) - 1 
      ) 
    )
  ) token
, N.id
from (        
  select ',' || name1 || ',' as wrapped, 'T1_' || to_char( id_t1 ) as id from t1 -- names wrapped in commas, (table)_id
  union all
  select ',' || name2 || ',' , 'T2_' || to_char( id_t2 ) from t2  
) N join (  
  select level as pos   -- (max) possible position of char in an existing token
  from dual 
  connect by level <= (
    select greatest(    -- find the longest string ie max position (query T1 and T2) 
      ( select max( length( name1 ) ) from t1 )
    , ( select max( length( name2 ) ) from t2 )
    ) as pos
    from dual
  )  
) T
  on T.pos <= ( length( N.wrapped ) - length( replace( N.wrapped, ',') ) ) - 1 
;

此SO答案是不使用CONNECT BY进行标记化的灵感.

The inspiration to tokenize without using CONNECT BY came from this SO answer.

TOKENS表的内容如下所示:

The contents of the TOKENS table will look something like this:

SQL> select * from tokens ;
TOKEN                           ID       
ASCORBIC ACID                   T1_1     
SODIUM HYDROGEN CARBONATE       T1_2     
CAFFEINE                        T1_3     
PSEUDOEPHEDRINE HYDROCHLORIDE   T1_4     
PARACETAMOL                     T1_100   
sodium hydroxide                T1_110   
POTASSIUM HYDROGEN CARBONATE    T2_4     
SODIUM HYDROGEN CARBONATE       T2_5     
PARACETAMOL PH. EUR.            T2_6     
CODEINE PHOSPHATE               T2_7     
DEXCHLORPHENIRAMINE MALEATE     T2_8     
DEXCHLORPHENIRAMINE MALEATE     T2_10    
PARACETAMOL                     T2_200 
...

{2} GROUP BY,LISTAGG,自我加入

select
  S1.id id1
, S2.id id2
, S1.tokengroup_T1
, S2.tokengroup_T2
from 
(
  select substr( id, 4, length( id ) - 3 ) id
  , listagg( token, ' + ' ) within group ( order by token ) tokengroup_T1
  from tokens
  group by id 
  having substr( id, 1, 3 ) = 'T1_'
) S1 
  join 
(
  select substr( id, 4, length( id ) - 3 ) id
  , listagg( token, ' + ' ) within group ( order by token ) tokengroup_T2
  from tokens
  group by id 
  having substr( id, 1, 3 ) = 'T2_'
) S2 
  on S1.tokengroup_T1 = S2.tokengroup_T2
;

-- result
ID1   ID2   TOKENGROUP_T1                                                 TOKENGROUP_T2                                                 
4     10    DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE   DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE   
110   210   potassium carbonate + sodium hydroxide                        potassium carbonate + sodium hydroxide                        
1     4     ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE    ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE    
3     6     CAFFEINE + PARACETAMOL PH. EUR.                               CAFFEINE + PARACETAMOL PH. EUR. 

以这种方式进行处理时,您可以使物质按字母(字母)顺序排列,还可以在此处选择所需的定界符"(我们用'+'表示).

When doing things this way, you can get the substances into (alphabetical) order, and you can also pick a "delimiter" that you like (we have used '+') here.

替代

如果这对您没有用,或者您认为这太复杂了,则可以尝试使用TRANSLATE().在这种情况下,建议您从数据集中删除所有空格/空格(在查询中-更改原始数据!),如下所示:

If all that is no use to you, or you think this is too complicated, then you could try using TRANSLATE(). In this case, I'd recommend stripping all spaces/blanks from your dataset (in a query - not altering the original data!) like so:

查询

select 
  id1, id2
, name1, name2
from (
  select 
    id_t1 id1
  , id_t2 id2
  , T1.name1 name1
  , T2.name2 name2
  from T1
    join T2 
      on  translate( replace( T1.name1, ' ', '' ), replace( T2.name2, ' ', '' ), '!' )
        = translate( replace( T2.name2, ' ', '' ), replace( T1.name1, ' ', '' ), '!' )
) ;

结果

  ID1   ID2 NAME1                                                                NAME2                                                        
    2     5 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID   SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS        
    3     6 CAFFEINE, PARACETAMOL PH. EUR.                                       PARACETAMOL PH. EUR.,CAFFEINE                                
  100    10 PARACETAMOL, DEXTROMETHORPHAN, PSEUDOEPHEDRINE, PYRILAMINE           DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE   
  110   210 sodium hydroxide, potassium carbonate                                sodium hydroxide, potassium carbonate

注意: 我已将以下行添加到您的示例数据中:

NOTE: I've added the following rows to your sample data:

-- T1
110, 'sodium hydroxide, potassium carbonate'

-- T2
210, 'sodium hydroxide, potassium carbonate' 
211, 'potassium hydroxide, sodium carbonate'

我发现很容易使用TRANSLATE()来给您假阳性",即ID为110、210和211的物质看起来会匹配". (换句话说:我认为这不是这项工作的正确工具.)

I found that it is easy to use TRANSLATE() in a way that gives you "false positives" ie the substances with ids 110, 210, and 211 will appear to "match". (In other words: I don't think this is the right tool for this job.)

此处为DBFIDDLE

(点击链接可查看示例表和查询).

(follow the link to see the sample tables and queries).

这篇关于比较两个不同表的两列中逗号分隔的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆