脚本UNI code字符转换上述<&U9999 GT;格式的ASCII码值 [英] Script to convert unicode characters in <U9999> format to their ASCII equivalents

查看:196
本文介绍了脚本UNI code字符转换上述<&U9999 GT;格式的ASCII码值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

林在做Linux语言文件中的某些变化的/ usr /共享/国际化/区域设置(如PT_BR),更改日期,时间,数字等的默认格式但由于单code字符是psented如&LT字符串$ p $; U9999方式> 格式,文字是很难读

下面是它的一个片段:

  LC_TIME
abday&所述; U0044>&下; U006F>&下; U006D>中;&下; U0053>&下; U0065>&下; U0067>中; /
    &所述; U0054>&下; U0065>&下; U0072>中;&下; U0051>&下; U0075>&下; U0061>中; /
    &所述; U0051>&下; U0075>&下; U0069>中;&下; U0053>&下; U0065>&下; U0078>中; /
    &所述; U0053>&下; U00E1>&下; U0062>中

那么,如何做一个简单的脚本(可能是bash中,蟒蛇,珍珠,等等)来转换该文本替换< Uxxxx> codeS到它们的ASCII等价物? (是的,他们都低于255的所有字符ASCI,大多数甚至低于127)

如果接收到几个答案,伊利诺伊州接受最优雅的和/或更详细的解释一(像命令对应的选项和标志)

作为一个例子,上面的文本将被转换为

  LC_TIME
abday大教堂,赛格/
    泰尔,第四纪/
    魁,性别; /
    SAB

另一个脚本,可以做相反的加分点:一个给定的字符串的所有字符转换为<&UXXX GT; 格式

谢谢!


解决方案

使用字段

 #!/斌/庆典awk的-F'< U0 + |>' {
    对于(i = 1; I< = NF;我++)
        如果($ I〜^ [0-9A-F] + $)
            $ I = sprintf的(%C,strtonum(0X$ i)条)
} 1'OFS =/路径/要/ INFILE

说明


  1. -F'< U0 + |> :这是使这个脚本这么短的魔力。我们告诉awk的该字段分隔符可以是< U0 + 或一个简单的> 。这样做的好处是,awk将自动带这些字符为我们,所以我们不必用 GSUB做手工()当谈到时间做strtonum()的转换。


  2. 为(i = 1; I< = NF;我++):遍历每个字段


  3. 如果($ I〜^ [0-9A-F] + $):检查当前场仅由十六进制数字。请记住,由于上述的东西#1像< U006F> 将在此时被视为 6F

  4. $ I = sprintf的(%C,strtonum(0X$ I)):替换其对应的ASCII码值的十六进制数字。我们必须preFIX $ I 0X这样的awk知道它的十六进制值
  5. } 1 :快捷方式的强制性打印始终打印每行

  6. OFS =:设置输出字段分隔符为空字符串。如果我们不这样做,我们会在输出空间到处是< U0 + >


使用匹配()需要呆子]

 #!/斌/庆典呆子'{
    而(匹配($ 0 /< U [0-9A-F] +> /)){
        拍拍= SUBSTR($ 0 RSTART,RLENGTH)
        GSUB(/ U0 + | [<>] /,拍拍)
        ASC = sprintf的(%C,strtonum(0XPAT))
        $ 0 = SUBSTR($ 0,1,RSTART-1)ASC SUBSTR($ 0 RSTART + RLENGTH)
    }
} 1'/路径/要/ INFILE

Im doing some changes in Linux locale files /usr/share/i18n/locales (like pt_BR), to change the default format of dates, time, numbers, etc. But since unicode chars are presented as strings in the <U9999> format, text is very hard to read.

Here is a snippet of it:

LC_TIME
abday   "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
    "<U0054><U0065><U0072>";"<U0051><U0075><U0061>";/
    "<U0051><U0075><U0069>";"<U0053><U0065><U0078>";/
    "<U0053><U00E1><U0062>"

So, how to make a simple script (may be bash, python, pearl, whatever) to convert this text replacing the <Uxxxx> codes to their ASCII equivalents? (yes, they are all ASCI chars below 255, most even below 127)

If several answers are received, Ill accept the most elegant and/or the more detailed explained one (like options and flags used in comands)

As an example, the above text would be converted to:

LC_TIME
abday   "Dom";"Seg";/
    "Ter";"Qua";/
    "Qui";"Sex";/
    "Sáb"

Bonus points for another script that could do the opposite: convert all chars of a given string to <Uxxx> format.

Thanks!

解决方案

Using Fields

#!/bin/bash

awk -F'<U0+|>' '{
    for(i=1;i<=NF;i++)
        if($i ~ "^[0-9A-F]+$")
            $i=sprintf("%c", strtonum("0x"$i))
}1' OFS="" /path/to/infile

Explanation

  1. -F'<U0+|>': This is the magic that makes this script so short. We tell awk that the field separator is either <U0+ or a simple >. The benefit of doing this is that awk will auto-strip these characters for us so we don't have to do it manually with gsub() when it comes time to do the strtonum() conversion.

  2. for(i=1;i<=NF;i++): iterate over each field

  3. if($i ~ "^[0-9A-F]+$"): check if the current field is only composed of hex digits. Remember that due to #1 above something like <U006F> will be seen as 6F at this point
  4. $i=sprintf("%c", strtonum("0x"$i)): replace the hex digit with its corresponding ascii value. We must prefix the field $i with "0x" so awk knows its a hex value
  5. }1: shortcut for a mandatory print or always print each line
  6. OFS="": set the Output Field Separator to the null string. If we don't do this, we will get spaces in the output everywhere there was a <U0+ or >


Using match() [requires gawk]

#!/bin/bash

gawk '{
    while(match($0, /<U[0-9A-F]+>/)){
        pat = substr($0,RSTART,RLENGTH)
        gsub(/U0+|[<>]/,"",pat)
        asc = sprintf("%c", strtonum("0x"pat))
        $0 = substr($0, 1, RSTART-1) asc substr($0, RSTART+RLENGTH)
    }
}1' /path/to/infile

这篇关于脚本UNI code字符转换上述&lt;&U9999 GT;格式的ASCII码值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆