计算大型Oracle表中的CLOB重复项 [英] Count CLOB Duplicates in a large Oracle Table

查看:51
本文介绍了计算大型Oracle表中的CLOB重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Oracle数据库表LOG_MESSAGES,其中有一个名为MESSAGE的CLOB列.一些行包含相同的MESSAGE.

I have an Oracle database table LOG_MESSAGES with a CLOB column called MESSAGE. Some of the rows contain the same MESSAGE.

对于每个至少具有重复项的MESSAGE,我想知道重复项的数量.这些CLOB中有许多很大(> 100 kB),因此转换为VARCHAR2毫无疑问.由于GROUP BY之类的许多传统方法不适用于CLOB,因此有人可以启发我吗?

For each MESSAGE which has at least a duplicate, I'd like to know the number of duplicates. Quite a number of these CLOBs are huge (> 100 kB), so converting to VARCHAR2 is out of question. Since many traditional methods such as GROUP BY do not work with CLOB, could someone please enlighten me?

有关信息,该表非常大(大约1 TB).因此,非常感谢您使用一种优化的方法.

For information, the table is very large (around 1 TB). So an optimised method would very much be appreciated.

预先感谢您的帮助.

推荐答案

我认为这个问题被问了很多,但不幸的是,这似乎并不是一种完美的方法.虽然有些方法可以很好地工作.

I think this question gets asked a lot but unfortunately there doesn't seem to be a perfect way of doing this. There are ways that work just fine though.

搜索群分类依据"或群不同",您将在此网站上看到一些匹配项.

Search for "clob group by" or "clob distinct" and you will see several hits just on this website.

一种方法是编写一个PL/SQL脚本,该脚本在表中的所有Clob之间执行DBMS_LOB.COMPARE,但是效率可能约为O(n ^ 2),这对于您的速度而言确实很慢目的.

One way would be to write a PL/SQL script that does a DBMS_LOB.COMPARE between all clobs in the table but the efficiency would probably be in the order of O(n^2) which would make it really slow for your purpose.

另一种广为接受的方法是使用dbms_crypto(我认为允许对Clob进行散列)获取Clob的散列值,然后对散列值进行分组.可能会发生哈希冲突,但是概率很小.我读到2 ^ 80左右的地方(尽管数字可能有误).这不会像第一种方法那样慢,但是计算散列也将花费不可忽略的时间.

Another approach that is well accepted is to take a hash value of the clob using dbms_crypto (i think that allows hashing on clobs) and then group by on the hash values. There is a possibility of hash collision, but the probability is minute. I read somewhere around 2^80 (number might be wrong though). This won't be as slow as the first approach but calculating a hash would also take non-negligible time.

我建议先尝试使用散列方法,如果这看起来太慢,请寻找替代方法.

I would suggest try the hash approach first and if that seems too slow, look for alternatives.

这篇关于计算大型Oracle表中的CLOB重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆