从 Python 文本文件中的字符串中删除 '\x' [英] Remove '\x' from string in a text file in Python
问题描述
这是我第一次在 Stack 上发帖.如果有人可以帮助我,我将不胜感激.
我正在尝试从包含以下内容的文本文件中删除 Unicode 字符(在我的例子中为 \x3a
):
10\x3a00\x3a00
最终输出应该是:
100000
基本上,我们被指示删除 \xXX
的所有痕迹,其中 X
可以是以下任何一个:0123456789ABCDEF
.我尝试使用正则表达式如下删除任何 \xXX
.
Re.sub('\\\x[a-fA-F0-9]{2}',"", a)
其中a
"是文本文件的一行.
当我尝试这样做时,我收到一条错误消息invalid \x escape
".
我已经为此苦苦挣扎了几个小时.我的正则表达式有什么问题?
字符 "\x3a"
不是多字节 Unicode 字符.它是 ASCII 字符 ":"
.指定字符串 "\x3a"
后,它在内部存储为字符 ":"
.Python 没有看到任何 "\"
动作发生.所以你不能把 "\x3a"
作为多字节 Unicode 去掉,因为 Python 只能看到单字节 ASCII 字符 ":"
.
$ python>>>'\x3a' == ':'真的>>>"10\x3a00\x3a00" == "10:00:00"真的
查看维基百科文章关于 UTF-8 的描述部分.看到 U+0000-U+007F
范围内的字符被编码为单个 ASCII 字符.
如果您想去除非 ASCII 字符,请执行以下操作:
<预><代码>>>>打印 u'R\xe9n\xe9'雷内>>>''.join([x for x in u'R\xe9n\xe9' if ord(x) <127])瓮'>>>''.join([x for x in 'Réné' if ord(x) <127])'恩'如果您想保留欧洲字符但丢弃码位较高的Unicode字符,则更改ord(x)中的
127
ord(x) <;127
到更高的值.
帖子 replace 3 byte unicode,有另一种方法.您还可以使用以下命令去除代码点范围:
<预><代码>>>>str = u'[\uE000-\uFFFF]'>>>长度(字符串)5>>>进口重新>>>模式 = re.compile(u'[\uE000-\uFFFF]', re.UNICODE)>>>pattern.sub('?', u'ab\uFFFDcd')你是吗?cd"请注意,使用 \u
可能比使用 \x
来指定字符更容易.
另一方面,您可以将字符串 "\\x3a"
去掉.当然,该字符串实际上并不是一个多字节的 Unicode 字符,而是 4 个 ASCII 字符.
$ python>>>打印 '\\x3a'\x3a>>>'\\x3a' == ':'错误的>>>'\\x3a' == '\\' + 'x3a'真的>>>(len('\x3a'), len('\\x3a'))(1, 4)
你也可以去掉ASCII字符":"
:
This is my first time posting on Stack. I would really appreciate if someone could assist me with this.
I’m trying to remove Unicode characters (\x3a
in my case) from a text file containing the following:
10\x3a00\x3a00
The final output is supposed to be:
100000
Basically, we are being instructed to delete all traces of \xXX
where X
can be any of the following: 0123456789ABCDEF
. I tried using regular expressions as follows to delete any \xXX
.
Re.sub(‘\\\x[a-fA-F0-9]{2}’,"", a)
Where "a
" is a line of a text file.
When I try that, I get an error saying "invalid \x escape
".
I’ve been struggling with this for hours. What’s wrong with my regular expression?
The character "\x3a"
is not a multi-byte Unicode character. It is the ASCII character ":"
. Once you have specified the string "\x3a"
, it is stored internally as the character ":"
. Python isn't seeing any "\"
action happening. So you can't strip out "\x3a"
as a multi-byte Unicode because Python is only seeing single byte ASCII character ":"
.
$ python
>>> '\x3a' == ':'
True
>>> "10\x3a00\x3a00" == "10:00:00"
True
Check out the description section of the Wikipedia article on UTF-8. See that characters in the range U+0000-U+007F
are encoded as a single ASCII character.
If you want to strip non-ASCII characters then do following:
>>> print u'R\xe9n\xe9'
Réné
>>> ''.join([x for x in u'R\xe9n\xe9' if ord(x) < 127])
u'Rn'
>>> ''.join([x for x in 'Réné' if ord(x) < 127])
'Rn'
If you want to retain European characters but discard Unicode characters with higher code points, then change the 127
in ord(x) < 127
to some higher value.
The post replace 3 byte unicode, has another approach. You can also strip out code point ranges with:
>>> str = u'[\uE000-\uFFFF]'
>>> len(str)
5
>>> import re
>>> pattern = re.compile(u'[\uE000-\uFFFF]', re.UNICODE)
>>> pattern.sub('?', u'ab\uFFFDcd')
u'ab?cd'
Notice that working with \u
may be easier than working with \x
for specifying characters.
On the other hand, you could have the string "\\x3a"
which you could strip out. Of course, that string isn't actually a multi-byte Unicode character but rather 4 ASCII characters.
$ python
>>> print '\\x3a'
\x3a
>>> '\\x3a' == ':'
False
>>> '\\x3a' == '\\' + 'x3a'
True
>>> (len('\x3a'), len('\\x3a'))
(1, 4)
You can also strip out the ASCII character ":"
:
>>> "10:00:00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace("\x3a", "")
'100000'
这篇关于从 Python 文本文件中的字符串中删除 '\x'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!