脚本UNI code字符转换上述＆lt;＆U9999 GT;格式的ASCII码值 [英] Script to convert unicode characters in <U9999> format to their ASCII equivalents

查看：196 发布时间：2016/8/3 12:04:26 python xml bash unicode

本文介绍了脚本UNI code字符转换上述＆lt;＆U9999 GT;格式的ASCII码值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

林在做Linux语言文件中的某些变化的/ usr /共享/国际化/区域设置（如PT_BR），更改日期，时间，数字等的默认格式但由于单code字符是psented如＆LT字符串$ p $; U9999方式＆gt; 格式，文字是很难读

下面是它的一个片段：

  LC_TIME
abday＆所述; U0044＆GT;＆下; U006F＆GT;＆下; U006D＆gt;中;＆下; U0053＆GT;＆下; U0065＆GT;＆下; U0067＆gt;中; /
    ＆所述; U0054＆GT;＆下; U0065＆GT;＆下; U0072＆gt;中;＆下; U0051＆GT;＆下; U0075＆GT;＆下; U0061＆gt;中; /
    ＆所述; U0051＆GT;＆下; U0075＆GT;＆下; U0069＆gt;中;＆下; U0053＆GT;＆下; U0065＆GT;＆下; U0078＆gt;中; /
    ＆所述; U0053＆GT;＆下; U00E1＆GT;＆下; U0062＆gt;中

那么，如何做一个简单的脚本（可能是bash中，蟒蛇，珍珠，等等）来转换该文本替换＆LT; Uxxxx＆GT; codeS到它们的ASCII等价物？（是的，他们都低于255的所有字符ASCI，大多数甚至低于127）

如果接收到几个答案，伊利诺伊州接受最优雅的和/或更详细的解释一（像命令对应的选项和标志）

作为一个例子，上面的文本将被转换为

  LC_TIME
abday大教堂，赛格/
    泰尔，第四纪/
    魁，性别; /
    SAB

另一个脚本，可以做相反的加分点：一个给定的字符串的所有字符转换为＆LT;＆UXXX GT; 格式

谢谢！

解决方案

使用字段

 ＃！/斌/庆典awk的-F'＆LT; U0 + |＆GT;' {
    对于（i = 1; I＆LT; = NF;我++）
        如果（$ I〜^ [0-9A-F] + $）
            $ I = sprintf的（％C，strtonum（0X$ i）条）
} 1'OFS =/路径/要/ INFILE

说明

-F'＆LT; U0 + |＆GT; ：这是使这个脚本这么短的魔力。我们告诉awk的该字段分隔符可以是＆LT; U0 + 或一个简单的＆GT; 。这样做的好处是，awk将自动带这些字符为我们，所以我们不必用 GSUB做手工（）当谈到时间做strtonum（）的转换。

为（i = 1; I＆LT; = NF;我++）：遍历每个字段

如果（$ I〜^ [0-9A-F] + $）：检查当前场仅由十六进制数字。请记住，由于上述的东西＃1像＆LT; U006F＆GT; 将在此时被视为 6F

$ I = sprintf的（％C，strtonum（0X$ I））：替换其对应的ASCII码值的十六进制数字。我们必须preFIX $ I 与0X这样的awk知道它的十六进制值
} 1 ：快捷方式的强制性打印或始终打印每行的

OFS =：设置输出字段分隔符为空字符串。如果我们不这样做，我们会在输出空间到处是＆LT; U0 + 或＆GT;

使用匹配（）需要呆子]

 ＃！/斌/庆典呆子'{
    而（匹配（$ 0 /＆LT; U [0-9A-F] +＆GT; /））{
        拍拍= SUBSTR（$ 0 RSTART，RLENGTH）
        GSUB（/ U0 + | [＆LT;＆GT;] /，拍拍）
        ASC = sprintf的（％C，strtonum（0XPAT））
        $ 0 = SUBSTR（$ 0，1，RSTART-1）ASC SUBSTR（$ 0 RSTART + RLENGTH）
    }
} 1'/路径/要/ INFILE

Im doing some changes in Linux locale files /usr/share/i18n/locales (like pt_BR), to change the default format of dates, time, numbers, etc. But since unicode chars are presented as strings in the <U9999> format, text is very hard to read.

Here is a snippet of it:

LC_TIME
abday   "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
    "<U0054><U0065><U0072>";"<U0051><U0075><U0061>";/
    "<U0051><U0075><U0069>";"<U0053><U0065><U0078>";/
    "<U0053><U00E1><U0062>"

So, how to make a simple script (may be bash, python, pearl, whatever) to convert this text replacing the <Uxxxx> codes to their ASCII equivalents? (yes, they are all ASCI chars below 255, most even below 127)

If several answers are received, Ill accept the most elegant and/or the more detailed explained one (like options and flags used in comands)

As an example, the above text would be converted to:

LC_TIME
abday   "Dom";"Seg";/
    "Ter";"Qua";/
    "Qui";"Sex";/
    "Sáb"

Bonus points for another script that could do the opposite: convert all chars of a given string to <Uxxx> format.

Thanks!

解决方案

Using Fields

#!/bin/bash

awk -F'<U0+|>' '{
    for(i=1;i<=NF;i++)
        if($i ~ "^[0-9A-F]+$")
            $i=sprintf("%c", strtonum("0x"$i))
}1' OFS="" /path/to/infile

Explanation

-F'<U0+|>': This is the magic that makes this script so short. We tell awk that the field separator is either <U0+ or a simple >. The benefit of doing this is that awk will auto-strip these characters for us so we don't have to do it manually with gsub() when it comes time to do the strtonum() conversion.
for(i=1;i<=NF;i++): iterate over each field
if($i ~ "^[0-9A-F]+$"): check if the current field is only composed of hex digits. Remember that due to #1 above something like <U006F> will be seen as 6F at this point
$i=sprintf("%c", strtonum("0x"$i)): replace the hex digit with its corresponding ascii value. We must prefix the field $i with "0x" so awk knows its a hex value
}1: shortcut for a mandatory print or always print each line
OFS="": set the Output Field Separator to the null string. If we don't do this, we will get spaces in the output everywhere there was a <U0+ or >

Using match() [requires gawk]

#!/bin/bash

gawk '{
    while(match($0, /<U[0-9A-F]+>/)){
        pat = substr($0,RSTART,RLENGTH)
        gsub(/U0+|[<>]/,"",pat)
        asc = sprintf("%c", strtonum("0x"pat))
        $0 = substr($0, 1, RSTART-1) asc substr($0, RSTART+RLENGTH)
    }
}1' /path/to/infile

这篇关于脚本UNI code字符转换上述＆lt;＆U9999 GT;格式的ASCII码值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

脚本UNI code字符转换上述＆lt;＆U9999 GT;格式的ASCII码值 [英] Script to convert unicode characters in <U9999> format to their ASCII equivalents

问题描述

使用字段

说明

使用匹配（）需要呆子]

Using Fields

Explanation

Using match() [requires gawk]

相关文章

Python最新文章

热门教程

热门工具

登录关闭

脚本UNI code字符转换上述＆lt;＆U9999 GT;格式的ASCII码值 [英] Script to convert unicode characters in &lt;U9999&gt; format to their ASCII equivalents

问题描述

使用字段

说明

使用匹配（）需要呆子]

Using Fields

Explanation

Using match() [requires gawk]

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

脚本UNI code字符转换上述＆lt;＆U9999 GT;格式的ASCII码值 [英] Script to convert unicode characters in <U9999> format to their ASCII equivalents

登录关闭