如何处理文件中的特殊字符(ഀ) [英] how to handle special characters ( ഀ ) in file
问题描述
我有一个文件,当我打开它 Notepad ++
I have a file which looks like this when I open it Notepad ++
A|B|C|D|Eഀ
31|HB|39|Ph|49ഀ
32|FB|38|Ph|59ഀ
当我尝试从 WinScp 打开它时,它看起来像下面
When I try to open it from WinScp , it looks like as below
ÿþA|B|C|D|E
31|HB|39|Ph|49
32|FB|38|Ph|59
我想在 BPEL 文件适配器中读取这个文件.但我无法阅读它,因为 {eol} 不正确.此外,我尝试将 ഀ 作为行尾但没有运气.
I want to read this file in BPEL File Adapter. But I am unable to read it as the {eol} is not proper.Moreover I have tried ഀ as end of line but no luck.
心理咨询师
<http://i.stack.imgur.com/Rc8B8.png >
提前致谢,
阿布舍克
推荐答案
当你在这个文件上运行 od -c 时(见上面的评论)你发现:
When you ran od -c on this file (see comments here above) you found:
0000000 377 376 A \0 | \0 B \0 | \0 C \0 | \0 D \0
0000020 | \0 E \0 \r \n \0 \r \n \0 3 \0 1 \0 | \0
0000040 H \0 B \0 | \0 3 \0 9 \0 | \0 P \0 h \0
0000060 | \0 4 \0 9 \0 \r \n \0 \r \n \0 3 \0 2 \0
0000100 | \0 F \0 B \0 | \0 3 \0 8 \0 | \0 P \0
0000120 h \0 | \0 5 \0 9 \0 \r \n \0 \r \n \0
0000136
让我们从头开始.
你注意到前两个字节了吗?八进制 377 和 376.别名 0xFF 0xFE 十六进制.这就是所谓的字节顺序标记 (BOM).它用于向必须读取文件的应用程序传递"信息(编码和字节序).
Did you notice the first two bytes? Octal 377 and 376. Alias 0xFF 0xFE hex. This is the so-called Byte Order Mark (BOM). It's used to "deliver" information to the application that will have to read the file (encoding and endianness).
现在,如果 BOM 是 0xFF 0xFE,则表示该文件包含 Unicode 字符编码的 UTF-16.准确地说,它是一个 Little Endian UTF-16 编码文件 (UTF-16LE).
Now, if BOM is 0xFF 0xFE it means the the file contains Unicode Characters encoded UTF-16. To be precise it's a Little Endian UTF-16 encoded file (UTF-16LE).
由于您的文件采用 UTF-16LE 编码...每个字符都需要两个字节:
As your file is encoded UTF-16LE... every character requires two bytes:
- 第一个字符(拉丁大写字母A)用A\0"表示
- 第二个字符是| \0"
- 第三个字符是B\0"
- 等等...
到目前为止……太好了.问题是\r\n".这个序列将是通常的 CR LF 如果文件被编码为 UTF-8 但 BOM 说文件被编码为 UTF-16LE 所以...
So far... so good. The problem is "\r \n". This sequence would be the usual CR LF if the file was encoded UTF-8 but the BOM says the file is encoded UTF-16LE so...
- 回车应由 0x0D 0x00 (\r \0 in od -c) 表示
- 由 0x0A 0x00 换行(\n \0 in od -c)
您使用\r \n"而不是\r \0 \n \0"这一事实会混淆不知道如何使用 UTF-16LE 编码解释"这些字节的应用程序.UTF-16LE 中的序列\r\n"无效,应用程序使用两个框"来表示这些无效字节.
The fact that you have "\r \n" instead of "\r \0 \n \0" confuses the application that doesn't know how to "interpret" these bytes using UTF-16LE encoding. The sequence "\r \n" in UTF-16LE is invalid and the application uses two "boxes" to represent these invalid bytes.
简而言之:您的文件编码错误(一半是 UTF-16LE,一半是 UTF-8).我猜有人确实使用记事本或类似的东西来更改其内容.
To keep it short: your file is badly encoded (half UTF-16LE and half UTF-8). I guess someone did use notepad or stuff like that to alter its content.
您可以尝试使用 iconv 和/或 sed 来修复它.
You can try using iconv and/or sed to fix it.
这篇关于如何处理文件中的特殊字符(ഀ)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!