将UTF-8字符序列转换为实际的UTF-8字节 [英] Convert UTF-8 character sequence to real UTF-8 bytes

查看:148
本文介绍了将UTF-8字符序列转换为实际的UTF-8字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个纯文本文件(.yml),其中包含这样的UTF-8字符序列:

I have a plain text-file (.yml) that contains UTF-8 character sequences like this:

foo: Dette er en \xC3\ xB8

foo: "Dette er en \xC3\xB8 "

问题出在\xC3\xB8-这些不是真实的 UTF-8字节,因为它们以8个实际的形式保存在文本文件中字符:\ x C 3 \ x B 8

The problem lies in \xC3\xB8 - These are not "real" UTF-8 bytes, since they are saved in the text file as 8 actual characters: \ x C 3 \ x B 8

有没有办法将这些转换成真正的2字节UTF-8序列?

Is there a way to get these converted into the real 2-bytes UTF-8 sequence?

可以使用任何操作系统/语言/ Shell工具:-)

Any OS / Language / Shell-tool may be used :-)

/ Carsten

/ Carsten

推荐答案

使用以下perl脚本转换文件:

Use this perl script to convert your file:

#!/usr/bin/perl
while (<STDIN>) {
  $_ =~ s/\\x([0-9A-F][0-9A-F])/chr(hex($1))/eg;
  print $_;
}

让我们假设您使用 bogusutf命名了一个脚本文件,然后使用以下命令进行转换:

Let's assume you named a file with script as bogusutf, then do the conversion with this command:


$ perl bogusutf < inputfile> outputfile

这篇关于将UTF-8字符序列转换为实际的UTF-8字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆