脚本转换ASCII字符到"<&UXXX GT;" UNI code表示法 [英] Script to convert ASCII chars to "<Uxxx>" unicode notation

查看:240
本文介绍了脚本转换ASCII字符到"<&UXXX GT;" UNI code表示法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在做Linux的环境文件,的/ usr /共享/国际化/区域设置(如 PT_BR 一些变化),并且它要求的格式字符串(如 D-%M-%%Y%H:%M )必须统一code,可以指定其中每个(在这种情况下,ASCII)字符作为psented &LT重新$ p $; U00xx方式>

I'm doing some changes in Linux locale files /usr/share/i18n/locales (like pt_BR), and it's required that format strings (like %d-%m-%Y %H:%M) must be specified in Unicode, where each (in this case, ASCII) character is represented as <U00xx>.

所以像这样的文字:

LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt   "%d-%m-%Y"
t_fmt   "%T"

必须是:

LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt   "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt   "<U0025><U0054>"

因此​​,我需要一个命令行脚本(无论是庆典,Python和Perl或其他什么东西),这将需要输入类似 D-%M-%%Y 并将其转换为<$c$c><U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>.

输入字符串中的所有字符是ASCII字符(从 0x20的 0x7F的),所以这实际上是票友字符到十六进制字符串转换。

All characters in the input string would be ASCII chars (from 0x20 to 0x7F), so this is actually a fancier "char-to-hex-string" conversion.

谁能帮帮我吗?我在bash脚本技能在Python非常有限,甚至更糟。

Could anyone please help me? My skills in bash scripting are very limited, and even worse in Python.

奖金优雅,解释解决方案。

Bonus for elegant, explained solutions.

谢谢!

(顺便说一句,这将是我的 previous问题反向脚本)

(by the way, this would be the "reverse" script for my previous question)

推荐答案

如果你想转换的文件到UNI code再presentation的每一个的角色,那么这将是这简单的一行

Every char with file input

If you wanted to convert every character of a file to the unicode representation, then it would be this simple one-liner

while IFS= read -r -n1 c;do printf "<U%04X>" "'$c"; done < ./infile


在STDIN每个字符

如果你想使一个类Unix工具,它的STDIN转换输入UNI code类输出,然后用这样的:


Every char on STDIN

If you want to make a unix-like tool which converts input on STDIN to unicode-like output, then use this:

uni(){ c=$(cat); for((i=0;i<${#c};i++)); do printf "<U%04X>" "'${c:i:1}"; done; }

概念验证

$ echo "abc" | uni
<U0061><U0062><U0063>


双引号之间只有字符

#!/bin/bash

flag=0
while IFS= read -r -n1 c; do
    if [[ "$c" == '"' ]]; then
        ((flag^=1))
        printf "%c" "$c"
    elif [[ "$c" == $'\0' ]]; then
        echo
    elif ((flag)); then
        printf "<U%04X>" "'$c"
    else
        printf "%c" "$c"
    fi
done < /path/to/infile

概念验证

$ cat ./unime
LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt   "%d-%m-%Y"
t_fmt   "%T"
abday "Dom";"Seg";/
here is a string with "multiline
quotes";/

$ ./uni.sh
LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt   "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt   "<U0025><U0054>"
abday "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
here is a string with "<U006D><U0075><U006C><U0074><U0069><U006C><U0069><U006E><U0065>
<U0071><U0075><U006F><U0074><U0065><U0073>";/

说明

pretty只是真的

Explanation

Pretty simply really


  1. 而IFS =读-r -n1℃; :在同一时间(通过遍历输入一个字符 -n1 )和变量炭C店。在 IFS = -r 标志是有使内建不会尝试做分词或间preT转义序列。

  2. 如果[$ C==']; :如果当前的字符是双引号

  3. ((^标志= 1)):从0-> 1或1-> 0
  4. 的elif [$ C== $'\\ 0']; :如果当前的字符是NUL,那么回声换行

  5. ELIF((标志)):如果标志为1,则执行单向code音译

  6. 的printf&LT; U%04X&gt;中'$ C:,做单向code音译的魔力。需要注意的是之前单引号的 $ C 是强制性的,因为它告诉的printf ,我们给它的ASCII重一些presentation。

  7. 其他的printf%C,$ C:打印出的字符与没有执行单向code音译

  1. while IFS= read -r -n1 c;: Iterate over the input one character at a time (via -n1) and store the char in the variable c. The IFS= and -r flags are there so that the read builtin doesn't try to do word splitting or interpret escape sequences, respectively.
  2. if [[ "$c" == '"' ]];: If the current char is a double-quote
  3. ((flag^=1)): Invert the value of flag from 0->1 or 1->0
  4. elif [[ "$c" == $'\0' ]];: If the current char is a NUL, then echo a newline
  5. elif ((flag)): If flag is 1, then perform unicode transliteration
  6. printf "<U%04X>" "'$c": The magic that does the unicode transliteration. Note that the single-quote before the $c is mandatory as it tells printf that we are giving it the ASCII representation of a number.
  7. else printf "%c" "$c": Print out the character with no unicode transliteration performed

这篇关于脚本转换ASCII字符到&QUOT;&LT;&UXXX GT;&QUOT; UNI code表示法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆