脚本UNI code字符转换上述&lt;&U9999 GT;格式的ASCII码值 [英] Script to convert unicode characters in <U9999> format to their ASCII equivalents
问题描述
林在做Linux语言文件中的某些变化的/ usr /共享/国际化/区域设置
(如PT_BR),更改日期,时间,数字等的默认格式但由于单code字符是psented如&LT字符串$ p $; U9999方式&gt;
格式,文字是很难读
下面是它的一个片段:
LC_TIME
abday&所述; U0044&GT;&下; U006F&GT;&下; U006D&gt;中;&下; U0053&GT;&下; U0065&GT;&下; U0067&gt;中; /
&所述; U0054&GT;&下; U0065&GT;&下; U0072&gt;中;&下; U0051&GT;&下; U0075&GT;&下; U0061&gt;中; /
&所述; U0051&GT;&下; U0075&GT;&下; U0069&gt;中;&下; U0053&GT;&下; U0065&GT;&下; U0078&gt;中; /
&所述; U0053&GT;&下; U00E1&GT;&下; U0062&gt;中
那么,如何做一个简单的脚本(可能是bash中,蟒蛇,珍珠,等等)来转换该文本替换&LT; Uxxxx&GT;
codeS到它们的ASCII等价物? (是的,他们都低于255的所有字符ASCI,大多数甚至低于127)
如果接收到几个答案,伊利诺伊州接受最优雅的和/或更详细的解释一(像命令对应的选项和标志)
作为一个例子,上面的文本将被转换为
LC_TIME
abday大教堂,赛格/
泰尔,第四纪/
魁,性别; /
SAB
另一个脚本,可以做相反的加分点:一个给定的字符串的所有字符转换为&LT;&UXXX GT;
格式
谢谢!
使用字段
#!/斌/庆典awk的-F'&LT; U0 + |&GT;' {
对于(i = 1; I&LT; = NF;我++)
如果($ I〜^ [0-9A-F] + $)
$ I = sprintf的(%C,strtonum(0X$ i)条)
} 1'OFS =/路径/要/ INFILE
说明
-
-F'&LT; U0 + |&GT;
:这是使这个脚本这么短的魔力。我们告诉awk的该字段分隔符可以是&LT; U0 +
或一个简单的&GT;
。这样做的好处是,awk将自动带这些字符为我们,所以我们不必用GSUB做手工()
当谈到时间做strtonum()的转换。 -
为(i = 1; I&LT; = NF;我++)
:遍历每个字段 -
如果($ I〜^ [0-9A-F] + $)
:检查当前场仅由十六进制数字。请记住,由于上述的东西#1像&LT; U006F&GT;
将在此时被视为6F
-
$ I = sprintf的(%C,strtonum(0X$ I))
:替换其对应的ASCII码值的十六进制数字。我们必须preFIX$ I
与0X
这样的awk知道它的十六进制值 -
} 1
:快捷方式的强制性打印
或始终打印每行的 -
OFS =
:设置输出字段分隔符为空字符串。如果我们不这样做,我们会在输出空间到处是&LT; U0 +
或&GT;
使用匹配()需要呆子]
#!/斌/庆典呆子'{
而(匹配($ 0 /&LT; U [0-9A-F] +&GT; /)){
拍拍= SUBSTR($ 0 RSTART,RLENGTH)
GSUB(/ U0 + | [&LT;&GT;] /,拍拍)
ASC = sprintf的(%C,strtonum(0XPAT))
$ 0 = SUBSTR($ 0,1,RSTART-1)ASC SUBSTR($ 0 RSTART + RLENGTH)
}
} 1'/路径/要/ INFILE
Im doing some changes in Linux locale files /usr/share/i18n/locales
(like pt_BR), to change the default format of dates, time, numbers, etc. But since unicode chars are presented as strings in the <U9999>
format, text is very hard to read.
Here is a snippet of it:
LC_TIME
abday "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
"<U0054><U0065><U0072>";"<U0051><U0075><U0061>";/
"<U0051><U0075><U0069>";"<U0053><U0065><U0078>";/
"<U0053><U00E1><U0062>"
So, how to make a simple script (may be bash, python, pearl, whatever) to convert this text replacing the <Uxxxx>
codes to their ASCII equivalents? (yes, they are all ASCI chars below 255, most even below 127)
If several answers are received, Ill accept the most elegant and/or the more detailed explained one (like options and flags used in comands)
As an example, the above text would be converted to:
LC_TIME
abday "Dom";"Seg";/
"Ter";"Qua";/
"Qui";"Sex";/
"Sáb"
Bonus points for another script that could do the opposite: convert all chars of a given string to <Uxxx>
format.
Thanks!
Using Fields
#!/bin/bash
awk -F'<U0+|>' '{
for(i=1;i<=NF;i++)
if($i ~ "^[0-9A-F]+$")
$i=sprintf("%c", strtonum("0x"$i))
}1' OFS="" /path/to/infile
Explanation
-F'<U0+|>'
: This is the magic that makes this script so short. We tell awk that the field separator is either<U0+
or a simple>
. The benefit of doing this is that awk will auto-strip these characters for us so we don't have to do it manually withgsub()
when it comes time to do the strtonum() conversion.for(i=1;i<=NF;i++)
: iterate over each fieldif($i ~ "^[0-9A-F]+$")
: check if the current field is only composed of hex digits. Remember that due to #1 above something like<U006F>
will be seen as6F
at this point$i=sprintf("%c", strtonum("0x"$i))
: replace the hex digit with its corresponding ascii value. We must prefix the field$i
with"0x"
so awk knows its a hex value}1
: shortcut for a mandatoryprint
or always print each lineOFS=""
: set the Output Field Separator to the null string. If we don't do this, we will get spaces in the output everywhere there was a<U0+
or>
Using match() [requires gawk]
#!/bin/bash
gawk '{
while(match($0, /<U[0-9A-F]+>/)){
pat = substr($0,RSTART,RLENGTH)
gsub(/U0+|[<>]/,"",pat)
asc = sprintf("%c", strtonum("0x"pat))
$0 = substr($0, 1, RSTART-1) asc substr($0, RSTART+RLENGTH)
}
}1' /path/to/infile
这篇关于脚本UNI code字符转换上述&lt;&U9999 GT;格式的ASCII码值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!