用sed剥离十六进制字节 - 不匹配 [英] Stripping hex bytes with sed - no match

查看:365
本文介绍了用sed剥离十六进制字节 - 不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有两个非ASCII字节(0xFF和0xFE)的文本文件:

  ?? 58832520.3,ABC 
348384,DEF

此文件的十六进制为:

  FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46 

FF和FE碰巧是前导字节(它们贯穿我的文件,尽管看起来总是在行的开头) )。
$ b 我试图用sed去掉这些字节,但是我没有做任何事情似乎与它们匹配。

  $ sed's / [^ a-zA-Z0-9 \,] // g'test.csv 
?? 588325203,ABC
348384 ,DEF

$ sed's / [a-zA-Z0-9 \,] // g'test.csv
??。

主要问题:如何去除这些字节?

奖励问题:上面两个正则表达式是直接否定,所以其中一个逻辑上必须过滤出这些字节,对吧?为什么这两个正则表达式匹配0xFF和0xFE字节?



Update :直接删除一个十六进制字节的范围(建议通过下面的两个答案)似乎删除每行的第一个合法字节,并留下我试图摆脱的字节:

  $ sed's / [\x80-\xff] //'test.csv 
?? 8832520.3,ABC
48384,DEF

FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A

注意每行开头缺少5和3,并将新的0A添加到文件末尾。

更新更新:此问题似乎是系统特定的。在OSX上观察到这个问题,但这些建议(包括我上面的原始sed语句)正如我期望的那样在NetBSD上运行。



解决方案:通过Perl,这个任务看起来很简单:



<$ p $ $ perl -pe's / ^ \xFF\xFE //'test.csv
58832520.3,ABC
348384,DEF

但是,我将这个问题留在了这个问题上,因为这只是一个解决方法,并不能解释sed的问题。 / p>

解决方案

  sed's / [^  - 〜] // g'

或其他答案暗示

  sed's / [\x80-\xff] // g'

请参阅sed信息页的第3.9节。对于OSX,本地语言设置为en_US.UTF-8


$标题为escapes



Edit b $ b

try

  LANG =''sed's / [^  - 〜] // g'myfile 

这适用于osx机器,我不完全确定它为什么在UTF中不起作用-8


I have a text file with two non-ascii bytes (0xFF and 0xFE):

??58832520.3,ABC
348384,DEF

The hex for this file is:

FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46

It's coincidental that FF and FE happen to be the leading bytes (they exist throughout my file, although seemingly always at the beginning of a line).

I am trying to strip these bytes out with sed, but nothing I do seems to match them.

$ sed 's/[^a-zA-Z0-9\,]//g' test.csv 
??588325203,ABC
348384,DEF

$ sed 's/[a-zA-Z0-9\,]//g' test.csv 
??.

Main question: How do I strip these bytes?
Bonus question: The two regex's above are direct negations, so one of them logically has to filter out these bytes, right? Why do both of these regex's match the 0xFF and 0xFE bytes?

Update: the direct approach of stripping out a range of hex byte (suggested by two answers below) seems to strip out the first "legit" byte from each line and leave the bytes I'm trying to get rid of:

$sed 's/[\x80-\xff]//' test.csv
??8832520.3,ABC
48384,DEF

FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A

Notice the missing "5" and "3" from the beginning of each line, and the new 0A added to the end of the file.

Bigger Update: This problem seems to be system-specific. The problem was observed on OSX, but the suggestions (including my original sed statement above) work as I expect them to on NetBSD.

A solution: This same task seems easy enough via Perl:

$ perl -pe 's/^\xFF\xFE//' test.csv
58832520.3,ABC
348384,DEF

However, I'll leave this question open since this is only a workaround, and doesn't explain what the problem was with sed.

解决方案

sed 's/[^ -~]//g'

or as the other answer implies

sed 's/[\x80-\xff]//g'

See section 3.9 of the sed info pages. The chapter entitled escapes.

Edit for OSX, the native lang setting is en_US.UTF-8

try

LANG='' sed 's/[^ -~]//g' myfile

This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8

这篇关于用sed剥离十六进制字节 - 不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆