比较两个不同表的两列中逗号分隔的值 [英] Comparing comma separated values from two columns of two different tables
问题描述
我想比较两列(差异表)的值,该列具有两个不同的Oracle表的逗号分隔值.我想找到与所有值都匹配的行(NAME1
所有值应该与NAME2
值匹配).
I want to compare the values of two columns (diff table) having comma separated values of two different Oracle tables. I want to find rows that match with all values (NAME1
all values should match with NAME2
values).
注意:用逗号分隔的值的顺序不同.
示例:
T1:
ID_T1 NAME1
===================================
1 ASCORBIC ACID, PARACETAMOL, POTASSIUM HYDROGEN CARBONATE
2 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID
3 CAFFEINE, PARACETAMOL PH. EUR.
4 PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE
T2:
ID_T2 NAME2
=================================
4 POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
5 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS
6 PARACETAMOL PH. EUR.,CAFFEINE
7 CODEINE PHOSPHATE, PARACETAMOL DC
8 DEXCHLORPHENIRAMINE MALEATE, DEXTROMETHORPHAN HYDROBROMIDE
10 DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE
我的结果"应仅在两个表中显示基于所有名称匹配"的匹配行.
MY RESULT should only show the matching row based on ALL NAME Matches in both tables.
ID_T1 ID_T2 MATCHING NAME
==================================
1 4 POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
3 6 PARACETAMOL PH. EUR.,CAFFEINE
4 10 PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE
尝试过REGEXP_SUBST
,但无法使其正常工作.
Tried with REGEXP_SUBST
but wasn't able to make it work.
我使用下面的代码来解析值:
I used the below code to parse the values:
SELECT REGEXP_SUBSTR (NAME1, '[^,]+', 1, ROWNUM)
FROM T1
CONNECT BY ROWNUM <= LENGTH (NAME1) -
LENGTH (REPLACE (NAME, ',')) + 1
推荐答案
您可以将表转换为第一标准格式,然后比较存储在每一行中的化合物.起点可以是:
You could get the table(s) into first normal form and then compare the compounds that are stored in each row. A starting point could be:
{1}对每行进行标记,然后将标记写入新表中.为每个令牌赋予其原始ID plus 3个字母的前缀,以指示该令牌来自哪个表. {2}按ID对新(规范化")表的行进行分组,并执行LISTAGG().执行自我连接,并找到匹配的令牌组".
{1} Tokenize each row, and write the tokens into a new table. Give each token its original ID plus a 3-letter prefix, indicating which table the token came from. {2} Group the rows of the new ("normalized") table by ID, and perform a LISTAGG(). Perform a self join, and find matching "token groups".
{1}标记化,创建表为select(CTAS)
create table tokens
as
select
ltrim( -- ltrim() and rtrim() remove leading/trailing spaces (blanks)
rtrim(
substr( N.wrapped
, instr( N.wrapped, ',', 1, T.pos ) + 1
, ( instr( N.wrapped, ',', 1, T.pos + 1 ) - instr( N.wrapped, ',', 1, T.pos ) ) - 1
)
)
) token
, N.id
from (
select ',' || name1 || ',' as wrapped, 'T1_' || to_char( id_t1 ) as id from t1 -- names wrapped in commas, (table)_id
union all
select ',' || name2 || ',' , 'T2_' || to_char( id_t2 ) from t2
) N join (
select level as pos -- (max) possible position of char in an existing token
from dual
connect by level <= (
select greatest( -- find the longest string ie max position (query T1 and T2)
( select max( length( name1 ) ) from t1 )
, ( select max( length( name2 ) ) from t2 )
) as pos
from dual
)
) T
on T.pos <= ( length( N.wrapped ) - length( replace( N.wrapped, ',') ) ) - 1
;
此SO答案是不使用CONNECT BY进行标记化的灵感.
The inspiration to tokenize without using CONNECT BY came from this SO answer.
TOKENS表的内容如下所示:
The contents of the TOKENS table will look something like this:
SQL> select * from tokens ;
TOKEN ID
ASCORBIC ACID T1_1
SODIUM HYDROGEN CARBONATE T1_2
CAFFEINE T1_3
PSEUDOEPHEDRINE HYDROCHLORIDE T1_4
PARACETAMOL T1_100
sodium hydroxide T1_110
POTASSIUM HYDROGEN CARBONATE T2_4
SODIUM HYDROGEN CARBONATE T2_5
PARACETAMOL PH. EUR. T2_6
CODEINE PHOSPHATE T2_7
DEXCHLORPHENIRAMINE MALEATE T2_8
DEXCHLORPHENIRAMINE MALEATE T2_10
PARACETAMOL T2_200
...
{2} GROUP BY,LISTAGG,自我加入
select
S1.id id1
, S2.id id2
, S1.tokengroup_T1
, S2.tokengroup_T2
from
(
select substr( id, 4, length( id ) - 3 ) id
, listagg( token, ' + ' ) within group ( order by token ) tokengroup_T1
from tokens
group by id
having substr( id, 1, 3 ) = 'T1_'
) S1
join
(
select substr( id, 4, length( id ) - 3 ) id
, listagg( token, ' + ' ) within group ( order by token ) tokengroup_T2
from tokens
group by id
having substr( id, 1, 3 ) = 'T2_'
) S2
on S1.tokengroup_T1 = S2.tokengroup_T2
;
-- result
ID1 ID2 TOKENGROUP_T1 TOKENGROUP_T2
4 10 DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE
110 210 potassium carbonate + sodium hydroxide potassium carbonate + sodium hydroxide
1 4 ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE
3 6 CAFFEINE + PARACETAMOL PH. EUR. CAFFEINE + PARACETAMOL PH. EUR.
以这种方式进行处理时,您可以使物质按字母(字母)顺序排列,还可以在此处选择所需的定界符"(我们用'+'表示).
When doing things this way, you can get the substances into (alphabetical) order, and you can also pick a "delimiter" that you like (we have used '+') here.
替代
如果这对您没有用,或者您认为这太复杂了,则可以尝试使用TRANSLATE().在这种情况下,建议您从数据集中删除所有空格/空格(在查询中-不更改原始数据!),如下所示:
If all that is no use to you, or you think this is too complicated, then you could try using TRANSLATE(). In this case, I'd recommend stripping all spaces/blanks from your dataset (in a query - not altering the original data!) like so:
查询
select
id1, id2
, name1, name2
from (
select
id_t1 id1
, id_t2 id2
, T1.name1 name1
, T2.name2 name2
from T1
join T2
on translate( replace( T1.name1, ' ', '' ), replace( T2.name2, ' ', '' ), '!' )
= translate( replace( T2.name2, ' ', '' ), replace( T1.name1, ' ', '' ), '!' )
) ;
结果
ID1 ID2 NAME1 NAME2
2 5 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS
3 6 CAFFEINE, PARACETAMOL PH. EUR. PARACETAMOL PH. EUR.,CAFFEINE
100 10 PARACETAMOL, DEXTROMETHORPHAN, PSEUDOEPHEDRINE, PYRILAMINE DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE
110 210 sodium hydroxide, potassium carbonate sodium hydroxide, potassium carbonate
注意: 我已将以下行添加到您的示例数据中:
NOTE: I've added the following rows to your sample data:
-- T1
110, 'sodium hydroxide, potassium carbonate'
-- T2
210, 'sodium hydroxide, potassium carbonate'
211, 'potassium hydroxide, sodium carbonate'
我发现很容易使用TRANSLATE()来给您假阳性",即ID为110、210和211的物质看起来会匹配". (换句话说:我认为这不是这项工作的正确工具.)
I found that it is easy to use TRANSLATE() in a way that gives you "false positives" ie the substances with ids 110, 210, and 211 will appear to "match". (In other words: I don't think this is the right tool for this job.)
(点击链接可查看示例表和查询).
(follow the link to see the sample tables and queries).
这篇关于比较两个不同表的两列中逗号分隔的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!