如何处理文件中的特殊字符(ഀ) [英] how to handle special characters ( ഀ ) in file

查看:39
本文介绍了如何处理文件中的特殊字符(ഀ)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,当我打开它 Notepad ++

I have a file which looks like this when I open it Notepad ++

         A|B|C|D|E਍ഀ
        31|HB|39|Ph|49਍ഀ
        32|FB|38|Ph|59਍ഀ

当我尝试从 WinScp 打开它时,它看起来像下面

When I try to open it from WinScp , it looks like as below

        ÿþA|B|C|D|E

         31|HB|39|Ph|49

         32|FB|38|Ph|59

我想在 BPEL 文件适配器中读取这个文件.但我无法阅读它,因为 {eol} 不正确.此外,我尝试将 ਍ഀ 作为行尾但没有运气.

I want to read this file in BPEL File Adapter. But I am unable to read it as the {eol} is not proper.Moreover I have tried ਍ഀ as end of line but no luck.

心理咨询师

<http://i.stack.imgur.com/Rc8B8.png >

提前致谢,

阿布舍克

推荐答案

当你在这个文件上运行 od -c 时(见上面的评论)你发现:

When you ran od -c on this file (see comments here above) you found:

0000000 377 376 A \0 | \0 B \0 | \0 C \0 | \0 D \0 
0000020 | \0 E \0 \r \n \0 \r \n \0 3 \0 1 \0 | \0 
0000040 H \0 B \0 | \0 3 \0 9 \0 | \0 P \0 h \0 
0000060 | \0 4 \0 9 \0 \r \n \0 \r \n \0 3 \0 2 \0 
0000100 | \0 F \0 B \0 | \0 3 \0 8 \0 | \0 P \0 
0000120 h \0 | \0 5 \0 9 \0 \r \n \0 \r \n \0 
0000136

让我们从头开始.

你注意到前两个字节了吗?八进制 377 和 376.别名 0xFF 0xFE 十六进制.这就是所谓的字节顺序标记 (BOM).它用于向必须读取文件的应用程序传递"信息(编码和字节序).

Did you notice the first two bytes? Octal 377 and 376. Alias 0xFF 0xFE hex. This is the so-called Byte Order Mark (BOM). It's used to "deliver" information to the application that will have to read the file (encoding and endianness).

现在,如果 BOM 是 0xFF 0xFE,则表示该文件包含 Unicode 字符编码的 UTF-16.准确地说,它是一个 Little Endian UTF-16 编码文件 (UTF-16LE).

Now, if BOM is 0xFF 0xFE it means the the file contains Unicode Characters encoded UTF-16. To be precise it's a Little Endian UTF-16 encoded file (UTF-16LE).

由于您的文件采用 UTF-16LE 编码...每个字符都需要两个字节:

As your file is encoded UTF-16LE... every character requires two bytes:

  • 第一个字符(拉丁大写字母A)用A\0"表示
  • 第二个字符是| \0"
  • 第三个字符是B\0"
  • 等等...

到目前为止……太好了.问题是\r\n".这个序列将是通常的 CR LF 如果文件被编码为 UTF-8 但 BOM 说文件被编码为 UTF-16LE 所以...

So far... so good. The problem is "\r \n". This sequence would be the usual CR LF if the file was encoded UTF-8 but the BOM says the file is encoded UTF-16LE so...

  • 回车应由 0x0D 0x00 (\r \0 in od -c) 表示
  • 由 0x0A 0x00 换行(\n \0 in od -c)

您使用\r \n"而不是\r \0 \n \0"这一事实会混淆不知道如何使用 UTF-16LE 编码解释"这些字节的应用程序.UTF-16LE 中的序列\r\n"无效,应用程序使用两个框"来表示这些无效字节.

The fact that you have "\r \n" instead of "\r \0 \n \0" confuses the application that doesn't know how to "interpret" these bytes using UTF-16LE encoding. The sequence "\r \n" in UTF-16LE is invalid and the application uses two "boxes" to represent these invalid bytes.

简而言之:您的文件编码错误(一半是 UTF-16LE,一半是 UTF-8).我猜有人确实使用记事本或类似的东西来更改其内容.

To keep it short: your file is badly encoded (half UTF-16LE and half UTF-8). I guess someone did use notepad or stuff like that to alter its content.

您可以尝试使用 iconv 和/或 sed 来修复它.

You can try using iconv and/or sed to fix it.

这篇关于如何处理文件中的特殊字符(ഀ)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆