从 Python 文本文件中的字符串中删除 '\x' [英] Remove '\x' from string in a text file in Python

查看:261
本文介绍了从 Python 文本文件中的字符串中删除 '\x'的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第一次在 Stack 上发帖.如果有人可以帮助我,我将不胜感激.

我正在尝试从包含以下内容的文本文件中删除 Unicode 字符(在我的例子中为 \x3a):

10\x3a00\x3a00

最终输出应该是:

100000

基本上,我们被指示删除 \xXX 的所有痕迹,其中 X 可以是以下任何一个:0123456789ABCDEF.我尝试使用正则表达式如下删除任何 \xXX.

Re.sub('\\\x[a-fA-F0-9]{2}',"", a)

其中a"是文本文件的一行.

当我尝试这样做时,我收到一条错误消息invalid \x escape".

我已经为此苦苦挣扎了几个小时.我的正则表达式有什么问题?

解决方案

字符 "\x3a" 不是多字节 Unicode 字符.它是 ASCII 字符 ":".指定字符串 "\x3a" 后,它在内部存储为字符 ":".Python 没有看到任何 "\" 动作发生.所以你不能把 "\x3a" 作为多字节 Unicode 去掉,因为 Python 只能看到单字节 ASCII 字符 ":".

$ python>>>'\x3a' == ':'真的>>>"10\x3a00\x3a00" == "10:00:00"真的

查看维基百科文章关于 UTF-8 的描述部分.看到 U+0000-U+007F 范围内的字符被编码为单个 ASCII 字符.

如果您想去除非 ASCII 字符,请执行以下操作:

<预><代码>>>>打印 u'R\xe9n\xe9'雷内>>>''.join([x for x in u'R\xe9n\xe9' if ord(x) <127])瓮'>>>''.join([x for x in 'Réné' if ord(x) <127])'恩'

如果您想保留欧洲字符但丢弃码位较高的Unicode字符,则更改ord(x)中的127 ord(x) <;127 到更高的值.

帖子 replace 3 byte unicode,有另一种方法.您还可以使用以下命令去除代码点范围:

<预><代码>>>>str = u'[\uE000-\uFFFF]'>>>长度(字符串)5>>>进口重新>>>模式 = re.compile(u'[\uE000-\uFFFF]', re.UNICODE)>>>pattern.sub('?', u'ab\uFFFDcd')你是吗?cd"

请注意,使用 \u 可能比使用 \x 来指定字符更容易.

另一方面,您可以将字符串 "\\x3a" 去掉.当然,该字符串实际上并不是一个多字节的 Unicode 字符,而是 4 个 ASCII 字符.

$ python>>>打印 '\\x3a'\x3a>>>'\\x3a' == ':'错误的>>>'\\x3a' == '\\' + 'x3a'真的>>>(len('\x3a'), len('\\x3a'))(1, 4)

你也可以去掉ASCII字符":":

<预><代码>>>>"10:00:00".replace(":", "")'100000'>>>"10\x3a00\x3a00".replace(":", "")'100000'>>>"10\x3a00\x3a00".replace("\x3a", "")'100000'

This is my first time posting on Stack. I would really appreciate if someone could assist me with this.

I’m trying to remove Unicode characters (\x3a in my case) from a text file containing the following:

10\x3a00\x3a00

The final output is supposed to be:

100000

Basically, we are being instructed to delete all traces of \xXX where X can be any of the following: 0123456789ABCDEF. I tried using regular expressions as follows to delete any \xXX.

Re.sub(‘\\\x[a-fA-F0-9]{2}’,"", a)

Where "a" is a line of a text file.

When I try that, I get an error saying "invalid \x escape".

I’ve been struggling with this for hours. What’s wrong with my regular expression?

解决方案

The character "\x3a" is not a multi-byte Unicode character. It is the ASCII character ":". Once you have specified the string "\x3a", it is stored internally as the character ":". Python isn't seeing any "\" action happening. So you can't strip out "\x3a" as a multi-byte Unicode because Python is only seeing single byte ASCII character ":".

$ python
>>> '\x3a' == ':'
True
>>> "10\x3a00\x3a00" == "10:00:00"
True

Check out the description section of the Wikipedia article on UTF-8. See that characters in the range U+0000-U+007F are encoded as a single ASCII character.

If you want to strip non-ASCII characters then do following:

>>> print u'R\xe9n\xe9'
Réné
>>> ''.join([x for x in u'R\xe9n\xe9' if ord(x) < 127])
u'Rn'
>>> ''.join([x for x in 'Réné' if ord(x) < 127])
'Rn'

If you want to retain European characters but discard Unicode characters with higher code points, then change the 127 in ord(x) < 127 to some higher value.

The post replace 3 byte unicode, has another approach. You can also strip out code point ranges with:

>>> str = u'[\uE000-\uFFFF]'
>>> len(str)
5
>>> import re
>>> pattern = re.compile(u'[\uE000-\uFFFF]', re.UNICODE)
>>> pattern.sub('?', u'ab\uFFFDcd')
u'ab?cd'

Notice that working with \u may be easier than working with \x for specifying characters.

On the other hand, you could have the string "\\x3a" which you could strip out. Of course, that string isn't actually a multi-byte Unicode character but rather 4 ASCII characters.

$ python
>>> print '\\x3a'
\x3a
>>> '\\x3a' == ':'
False
>>> '\\x3a' == '\\' + 'x3a'
True
>>> (len('\x3a'), len('\\x3a'))
(1, 4)

You can also strip out the ASCII character ":":

>>> "10:00:00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace("\x3a", "")
'100000'

这篇关于从 Python 文本文件中的字符串中删除 '\x'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆