过滤掉不可读的字符 [英] Filtering out non-readable characters

查看:82
本文介绍了过滤掉不可读的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含二进制和ascii字符的文件。我按下了

数据并将其转换为更易读的格式,但是它仍然带有一些二进制字符混合在一起。我想写点什么

只用''''替换所有不可打印的字符(我想删除

不可打印的字符)。


我无法找到一个简单的python方法来实现这个目标......是
最简单的方法就是编写一些正常的表达式来完成

就像替换[^ \p]带''''?


或者更好地浏览每个角色并做ord(角色),

检查ascii值?


这样做最简单的方法是什么?


谢谢

I have a file with binary and ascii characters in it. I massage the
data and convert it to a more readable format, however it still comes
up with some binary characters mixed in. I''d like to write something
to just replace all non-printable characters with '''' (I want to delete
non-printable characters).

I am having trouble figuring out an easy python way to do this... is
the easiest way to just write some regular expression that does
something like replace [^\p] with ''''?

Or is it better to go through every character and do ord(character),
check the ascii values?

What''s the easiest way to do something like this?

thanks

推荐答案

2005年7月15日17:33:39 -0700,MKoool <莫*** @ terabolic.com>写道:
On 15 Jul 2005 17:33:39 -0700, "MKoool" <mo***@terabolic.com> wrote:
我有一个二进制和ascii字符的文件。我按下
数据并将其转换为更易读的格式,但它仍然会混入一些二进制字符。我想写一些东西
来替换所有非-printable字符''''(我想删除
不可打印的字符)。

我无法找到一个简单的python方法来做到这一点...是
或者更好地浏览每个字符和做ord(字符),
检查ascii值?

这样做最简单的方法是什么?
I have a file with binary and ascii characters in it. I massage the
data and convert it to a more readable format, however it still comes
up with some binary characters mixed in. I''d like to write something
to just replace all non-printable characters with '''' (I want to delete
non-printable characters).

I am having trouble figuring out an easy python way to do this... is
the easiest way to just write some regular expression that does
something like replace [^\p] with ''''?

Or is it better to go through every character and do ord(character),
check the ascii values?

What''s the easiest way to do something like this?


import string
string.printable
''0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLM NOPQRSTUVWXYZ!"#
import string
string.printable ''0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLM NOPQRSTUVWXYZ!"#


%& \''()* +, - ./ :;< =>?@ [\\] ^ _`{|}〜\\ t [\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ ([c代表c,如果c不在string.printable中])

def remove_unprintable(s):
... return s.translate(identity,unprintable)

... set(remove_unprintable(identity)) - set(string.printable)
set([])set(remove_unprintable(identity))
set([''\ x0c'' ','''''''
%&\''()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'' identity = ''''.join([chr(i) for i in xrange(256)])
unprintable = ''''.join([c for c in identity if c not in string.printable])

def remove_unprintable(s): ... return s.translate(identity, unprintable)
... set(remove_unprintable(identity)) - set(string.printable) set([]) set(remove_unprintable(identity)) set([''\x0c'', '' '', ''


'',''('','','',''0'',''4'',''8 '',''<'',''''',''D'',''H'',''L'',''P'',''T'',''X' ',''\\'',''`

'','''','''','''',''''''' 't'',''x'',''|'',''\ x0b'',''#'',"''",''+'',''/'', ''3'',''7'','';'',''?'' ,''C'',''G'',

''K'','''','''',''W'',''['' ,'''',''c'',''g'',''k'',''o'','s'',''w'',''{'',' '\\ n','''''''''''''''''''''''''''''''''''''''''''''''' '','':'',''>'',''B'',''F'',''J'',''N'',''R'',''V' ',''Z'',''''',''b'',''f'',''j'',''n'',''r'',''v'', ''z'',''〜'',''

\'',''\ r'',''!'',''%'','' )'','' - '',''1'',''''',''9'',''='',''A'',''E'',''我' ',''M'','Q'','''',''Y'','']'',''''',

''e' ',''我','是',''q'',''你',''y'',''}''])sorted(set(remove_unprintable(identity)))== sorted(set(string.printable))
True sorted((remove_unprintable(identity)))== sorted((string.printable))
True


之后,要获得干净的文件文本,比如


cleantext = remove_unprintable(file(''unclean.txt'')。read())


应该这样做。或者你应该可以通过行(例如(未经测试的)


对文件中的uncleanline进行迭代(''unclean.txt''):

cleanline = remove_unprintable(uncleanline)

#...用干净的线做任何事情


如果string.printable中有什么东西你不想包括在内,只需使用您自己的

字符串的printables。 BTW,

help(str.translate)
'', ''('', '','', ''0'', ''4'', ''8'', ''<'', ''@'', ''D'', ''H'', ''L'', ''P'', ''T'', ''X'', ''\\'', ''`
'', ''d'', ''h'', ''l'', ''p'', ''t'', ''x'', ''|'', ''\x0b'', ''#'', "''", ''+'', ''/'', ''3'', ''7'', '';'', ''?'', ''C'', ''G'',
''K'', ''O'', ''S'', ''W'', ''['', ''_'', ''c'', ''g'', ''k'', ''o'', ''s'', ''w'', ''{'', ''\n'', ''"'', ''&'', ''*'', ''.'', ''2'',
''6'', '':'', ''>'', ''B'', ''F'', ''J'', ''N'', ''R'', ''V'', ''Z'', ''^'', ''b'', ''f'', ''j'', ''n'', ''r'', ''v'', ''z'', ''~'', ''
\t'', ''\r'', ''!'', ''%'', '')'', ''-'', ''1'', ''5'', ''9'', ''='', ''A'', ''E'', ''I'', ''M'', ''Q'', ''U'', ''Y'', '']'', ''a'',
''e'', ''i'', ''m'', ''q'', ''u'', ''y'', ''}'']) sorted(set(remove_unprintable(identity))) == sorted(set(string.printable)) True sorted((remove_unprintable(identity))) == sorted((string.printable)) True

After that, to get clean file text, something like

cleantext = remove_unprintable(file(''unclean.txt'').read())

should do it. Or you should be able to iterate by lines something like (untested)

for uncleanline in file(''unclean.txt''):
cleanline = remove_unprintable(uncleanline)
# ... do whatever with clean line

If there is something in string.printable that you don''t want included, just use your own
string of printables. BTW,
help(str.translate)



method_descriptor的帮助:


翻译(...)

S.translate(table [,deletechars]) - > string


返回字符串S的副本,其中删除了可选参数deletechars中出现的所有字符

,以及

剩余的字符已通过给定的

转换表进行映射,该转换表必须是长度为256的字符串。


问候,

Bengt Richter


Help on method_descriptor:

translate(...)
S.translate(table [,deletechars]) -> string

Return a copy of the string S, where all characters occurring
in the optional argument deletechars are removed, and the
remaining characters have been mapped through the given
translation table, which must be a string of length 256.

Regards,
Bengt Richter


这篇关于过滤掉不可读的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆