用 sed 剥离十六进制字节 - 不匹配 [英] Stripping hex bytes with sed - no match

查看:16
本文介绍了用 sed 剥离十六进制字节 - 不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含两个非 ascii 字节(0xFF 和 0xFE)的文本文件:

I have a text file with two non-ascii bytes (0xFF and 0xFE):

??58832520.3,ABC
348384,DEF

这个文件的十六进制是:

The hex for this file is:

FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46

巧合的是 FF 和 FE 恰好是前导字节(它们存在于我的整个文件中,尽管似乎总是在一行的开头).

It's coincidental that FF and FE happen to be the leading bytes (they exist throughout my file, although seemingly always at the beginning of a line).

我正在尝试用 sed 去除这些字节,但我所做的一切似乎都不匹配.

I am trying to strip these bytes out with sed, but nothing I do seems to match them.

$ sed 's/[^a-zA-Z0-9,]//g' test.csv 
??588325203,ABC
348384,DEF

$ sed 's/[a-zA-Z0-9,]//g' test.csv 
??.

主要问题:如何去除这些字节?
额外问题:上面的两个正则表达式是直接否定,所以逻辑上其中之一必须过滤掉这些字节,对吗?为什么这两个正则表达式都匹配 0xFF 和 0xFE 字节?

Main question: How do I strip these bytes?
Bonus question: The two regex's above are direct negations, so one of them logically has to filter out these bytes, right? Why do both of these regex's match the 0xFF and 0xFE bytes?

更新:剥离一系列十六进制字节的直接方法(由下面的两个答案建议)似乎从每一行中剥离出第一个合法"字节并留下我的字节试图摆脱:

Update: the direct approach of stripping out a range of hex byte (suggested by two answers below) seems to strip out the first "legit" byte from each line and leave the bytes I'm trying to get rid of:

$sed 's/[x80-xff]//' test.csv
??8832520.3,ABC
48384,DEF

FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A

注意每行开头缺少的5"和3",以及新的 0A 添加到文件末尾.

Notice the missing "5" and "3" from the beginning of each line, and the new 0A added to the end of the file.

更大的更新:这个问题似乎是系统特定的.这个问题是在 OSX 上观察到的,但是这些建议(包括我上面的原始 sed 语句)在 NetBSD 上按我的预期工作.

Bigger Update: This problem seems to be system-specific. The problem was observed on OSX, but the suggestions (including my original sed statement above) work as I expect them to on NetBSD.

一个解决方案:同样的任务通过 Perl 似乎很容易:

A solution: This same task seems easy enough via Perl:

$ perl -pe 's/^xFFxFE//' test.csv
58832520.3,ABC
348384,DEF

但是,我将这个问题保持开放,因为这只是一种解决方法,并没有解释 sed 的问题所在.

However, I'll leave this question open since this is only a workaround, and doesn't explain what the problem was with sed.

推荐答案

sed 's/[^ -~]//g'

或者正如另一个答案所暗示的

or as the other answer implies

sed 's/[x80-xff]//g'

请参阅 sed 信息页面的第 3.9 节.标题为逃脱的章节.

See section 3.9 of the sed info pages. The chapter entitled escapes.

编辑对于 OSX,本机语言设置为 en_US.UTF-8

Edit for OSX, the native lang setting is en_US.UTF-8

试试

LANG='' sed 's/[^ -~]//g' myfile

这适用于这里的 osx 机器,我不完全确定为什么它在 UTF-8 中不起作用

This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8

这篇关于用 sed 剥离十六进制字节 - 不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆