从文本文件中删除二进制控制字符 [英] Removing binary control characters from a text file

查看:210
本文介绍了从文本文件中删除二进制控制字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,其中包含二进制控制字符,例如"^ @"和"^ M".当我尝试直接在文本文件上执行字符串操作时,控制字符会使脚本崩溃.

I have a text file that contains binary control characters, such as "^@" and "^M". When I try to perform string operations directly on the text file, the control characters crash the script.

通过反复试验,我发现more命令将去除控制字符,以便我可以正确处理文件.

Through trial and error, I discovered that the more command will strip the control characters so that I can process the file properly.

more file_with_control_characters.not_txt > file_without_control_characters.txt

这被认为是一种好方法,还是有一种更好的方法从文本文件中删除控制字符? more在Windows 8之前的操作系统中是否具有这种行为?

Is this considered a good method, or is there a better way to remove control characters from a text file? Does more have this behavior in OSes earlier than Windows 8?

推荐答案

当然,您不想简单地删除所有控制字符.换行符和制表符也是控制字符,您不想删除它们.

Certainly you do not want to simply remove all control characters. Newline and Tab characters are control characters as well, and you don't want to remove those.

我假设您的^M是回车符,而^@是NULL字节.回车不会给您造成问题,更多信息也不会消除它们.但是,如果您的实用程序需要ASCII文本文件,则NULL字节可能会导致问题.

I'm assuming your ^M is a carriage return, and ^@ is a NULL byte. The carriage returns are not causing you problems, and MORE does not remove them. But NULL bytes can cause problems if your utility is expecting ASCII text files.

您的输入文件很可能是UTF-16.更多信息正在将UTF-16转换为ANSI(扩展的ASCII)格式,该格式确实可以删除NULL字节.它还会将非ASCII值转换为十进制128-255字节值范围内的扩展ASCII字符.我相信它会使用您的活动代码页(CHCP)值来找出哪些字符映射到哪里,但是我并不肯定.

Your input file is most likely UTF-16. MORE is converting the UTF-16 into ANSI (extended ASCII) format, which does effectively remove the NULL bytes. It also converts non-ASCII values into extended ASCII characters in the decimal 128 - 255 byte value range. I believe it uses your active code page (CHCP) value to figure out what characters map where, but I'm not positive.

您应该注意一些其他问题.

You should be aware of some additional issues.

  • 更多"会将所有制表符转换为一系列空格,并且您无法控制多少空格(该空格取决于行中的当前位置).

  • MORE will convert all Tab characters into a series of spaces, and you cannot control how many spaces (it varies depending on the current position in the line).

更多"将始终以\ r \ n(回车和换行)终止每一行.

MORE will always terminate each line with \r\n (carriage return and line feed).

更多"还会删除文件开头的两个字节的BOM(如果存在). BOM表示UTF-16格式.但是MORE不需要2字节BOM指示器,无论如何它都会将UTF-16转换为ANSI.

MORE also removes the two byte BOM at the beginning of the file, if it exists. The BOM indicates the UTF-16 format. But MORE does not require the 2 byte BOM indicator, it will convert the UTF-16 to ANSI regardless.

如果您的文件超过64K行,几乎可以无限期地挂起更多文件.

Lastly MORE can hang indefinitely if your file exceeds 64K lines.

如果有更多适合您的东西,则一定要使用它.

If MORE works for you, than by all means use it.

另一种选择是使用TYPE,它也会将UTF-16转换为ANSI:

One other option is to use TYPE, which will also convert UTF-16 to ANSI:

type "yourFile.txt" >"newFile.txt"

TYPE肯定会基于活动代码页映射非ASCII代码.

TYPE definitely maps non-ASCII codes based on the active code page.

TYPE转换与MORE的方式有所不同

There are some differences with how TYPE converts vs. MORE

  • TYPE的一个优点是它不会将Tab字符转换为空格.

  • One advantage of TYPE is it does not convert Tab characters to spaces.

另一个优点是它不会与大文件一起挂起.

Another advantage is it will not hang with large files.

另一个区别(可能是好,也可能是坏)是不会向没有行终止符的行添加行终止符.

Another difference (maybe good, maybe bad) is it will not add a line terminator to a line that does not already have one.

TYPE的潜在缺点是,如果输入缺少BOM,则无法将UTF-16转换为ANSI.

A potential disadvantage of TYPE is it will not convert UTF-16 to ANSI if the input is missing the BOM.

这篇关于从文本文件中删除二进制控制字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆