使用sql / plsql,你如何找出文本使用哪个字符集? [英] Using sql/plsql, how do you find out which character set a text uses?

查看:600
本文介绍了使用sql / plsql,你如何找出文本使用哪个字符集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Oracle数据库,它存储源自世界各地,使用不同语言的文档的内容。文档存储在具有存储文档内容的BLOB列的表中。



我想使用Oracle过程找出每个文档的字符集。我不想使用实用程序CSSCAN,因为它似乎你必须在一个单独的会话,在你的程序之外使用它。



感谢您的帮助! p>

解决方案

Oracle全球化开发工具包可以检测字符集。



GDK包含在Oracle中,但默认情况下不安装在数据库中。要将.jar文件加载到数据库中,请在Oracle
home中查找jlib目录,然后运行此操作系统命令:

  loadjava -u用户名@ SID orai18n.jar orai18n-collat​​ion.jar orai18n-lcsd.jar orai18n-mapping.jar orai18n-net.jar orai18n-的servlet.jar orai18n-tools.jar中orai18n-translation.jar orai18n效用。 jar 

需要一些额外的Java权限,即使您的用户有DBA。运行此命令,然后重新连接:

  EXEC dbms_java.grant_permission('YOUR_USER_NAME','SYS:java.lang.RuntimePermission ','getClassLoader',''); 

创建一个Java类来进行检测。下面是一个返回字符串最佳猜测的非常简单的示例:

 创建或替换和编译名为Character_Set_Detector 
as
import oracle.i18n.lcsd。*;
import java.sql。*;
import java.io.IOException;
公共类Character_Set_Detector
{
公共静态字符串检测(BLOB some_blob)抛出的SQLException,IOException异常
{
LCSDetector探测器=新LCSDetector();
detector.detect(some_blob.getBinaryStream());
LCSDResultSet detector_results = detector.getResult();
return detect_results.getORACharacterSet();
}
}
/

PL / SQL函数:

   - 在PL / SQL函数中包装Java类:
create或replace function detect_character_set(some_blob blob)
return varchar2
as language java
name'Character_Set_Detector.detect(java.sql.Blob)return java.lang.String';
/

我通过将字符串翻译成不同的语言来模拟不同的字符集,作为与文本编辑器的不同编码,使用十六进制编辑器打开文件,并将十六进制转换为BLOB:

   -  UTF8 
--The快速的棕色狐狸跳过懒狗
选择1号,detect_character_set(HEXTORAW('54686520717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f67'))从双工会CHARACTER_SET所有
--Western欧洲(ISO-8859- 1)
--El zorromarrónrápidosalta sobre el perro perezoso
select 2 id,detect_character_set(hextoraw('456c207a6f72726f206d617272f36e2072e17069646f2073616c746120736f62726520656c20706572726f20706572657a6f736f'))from dual union all
- 简体中文b $ b - 敏捷的棕色狐狸跳过懒狗
选择3号,detect_character_set(HEXTORAW('c3f4bdddb5c4d7d8c9abbafcc0eaccf8b9fdc0c1b9b7'))从双UNION ALL
--Western语系(Windows-1252)$ b $从双联都全
--Cyrillic(KOI8-R)
- - - - - - - - - - - - - - - - - - - -Быстраякоричневаялисапрыгаетчерезленивуюсобаку
选择5号,detect_character_set(HEXTORAW('e2d9d3d4d2c1d120cbcfd2c9decec5d7c1d120ccc9d3c120d0d2d9c7c1c5d420dec5d2c5da20ccc5cec9d7d5c020d3cfc2c1cbd5'))从双;

ID CHARACTER_SET
- -------------
1 US7ASCII
2 WE8ISO8859P1
3 ZHS16CGB231280
4 WE8ISO8859P1
5 CL8KOI8R

这简单的例子,效果很好,但我不知道如何它将与真实世界的文件。 GDK中有很多功能,上面的代码只是一个简单的起点。只需进行少量更改,代码就可以检测语言,如我的回答此处所示。


I have an Oracle db which stores the content of documents originating from all over the world, with different languages. The documents are stored in a table with a BLOB column which stores the documents' content.

I want to find out what the char set is for every doc, with an Oracle procedure. I don't want to use the utility CSSCAN since it seems you have to use it in a separate session, outside of your procedure.

Thanks for your help!

解决方案

Oracle Globalization Development Kit can detect character sets.

The GDK is included with Oracle but it is not installed in the database by default. To load the .jar files into the database find the jlib directory in the Oracle home and run this operating system command:

loadjava -u USER_NAME@SID orai18n.jar orai18n-collation.jar orai18n-lcsd.jar orai18n-mapping.jar orai18n-net.jar orai18n-servlet.jar orai18n-tools.jar orai18n-translation.jar orai18n-utility.jar

Some extra Java privileges are needed, even if your user has DBA. Run this command and then re-connect:

exec dbms_java.grant_permission( 'YOUR_USER_NAME', 'SYS:java.lang.RuntimePermission', 'getClassLoader', '' );

Create a Java class to do the detection. Below is a very simple example that returns the best guess for a string:

create or replace and compile java source named "Character_Set_Detector"
as
import oracle.i18n.lcsd.*;
import java.sql.*;
import java.io.IOException;
public class Character_Set_Detector
{
    public static String detect(Blob some_blob) throws SQLException, IOException
    {
        LCSDetector detector = new LCSDetector();
        detector.detect(some_blob.getBinaryStream());
        LCSDResultSet detector_results = detector.getResult();
        return detector_results.getORACharacterSet();
    }
}
/

Wrap the Java class in a PL/SQL function:

--Wrap the Java class in a PL/SQL function:
create or replace function detect_character_set(some_blob blob)
return varchar2
as language java
name 'Character_Set_Detector.detect(java.sql.Blob) return java.lang.String';
/

I simulated different character sets by translating a string into different languages, saving the text as different encodings with a text editor, opening the file with hex editor, and converting the hex into a BLOB:

--UTF8
--The quick brown fox jumps over the lazy dog
select 1 id, detect_character_set(hextoraw('54686520717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f67')) character_set from dual union all
--Western European (ISO-8859-1)
--El zorro marrón rápido salta sobre el perro perezoso
select 2 id, detect_character_set(hextoraw('456c207a6f72726f206d617272f36e2072e17069646f2073616c746120736f62726520656c20706572726f20706572657a6f736f')) from dual union all
--Chinese Simplified (GBK)
--敏捷的棕色狐狸跳过懒狗
select 3 id, detect_character_set(hextoraw('c3f4bdddb5c4d7d8c9abbafcc0eaccf8b9fdc0c1b9b7')) from dual union all
--Western European (Windows-1252)
--Der schnelle braune Fuchs springt über den faulen Hund
select 4 id, detect_character_set(hextoraw('446572207363686e656c6c6520627261756e6520467563687320737072696e677420fc6265722064656e206661756c656e2048756e64')) from dual union all
--Cyrillic (KOI8-R)
--Быстрая коричневая лиса прыгает через ленивую собаку
select 5 id, detect_character_set(hextoraw('e2d9d3d4d2c1d120cbcfd2c9decec5d7c1d120ccc9d3c120d0d2d9c7c1c5d420dec5d2c5da20ccc5cec9d7d5c020d3cfc2c1cbd5')) from dual;

ID  CHARACTER_SET
--  -------------
1   US7ASCII
2   WE8ISO8859P1
3   ZHS16CGB231280
4   WE8ISO8859P1
5   CL8KOI8R

That trivial example works well but I don't know how well it will work with real-world files. There are a lot of features in the GDK, the above code is only a simple starting point. With only minor changes the code can also detect languages as demonstrated in my answer here.

这篇关于使用sql / plsql,你如何找出文本使用哪个字符集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆