带有utf8问题的python正则表达式 [英] python regular expression with utf8 issue
问题描述
我得到了一个包含多行纯utf-8文本的文件.如下所示,顺便说一下,它是中文.
I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese.
PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012-2-29 10:13:08
文件本身以utf-8格式保存.文件名为xx.txt
The file itself was saved in utf-8 format. file name is xx.txt
这是我的python代码,env是python2.7
here is my python code, env is python2.7
#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+)元')
for line in open('xx.txt'):
match = pattern.match(line.decode('utf-8'))
if match:
print match.group()
这里有问题的是我没有结果.
The problematic thing here is I got no results.
我想从这里的0.01
中获取交易金额:0.01元
的十进制字符串.
I wanna get the decimal string from 交易金额:0.01元
, in here, which is 0.01
.
为什么此代码不起作用?谁能向我解释,我什么也没头绪.
Why doesn't this code work? Can anyone explain it to me, I got no clue whatsoever.
推荐答案
您的代码有几个问题.首先,您应该使用re.compile(ur'<unicode string>')
.另外,添加re.UNICODE标志也很不错(虽然不确定此处是否确实需要).下一个是您仍然不会收到匹配项,因为\d+
不能只处理一系列数字,而应该使用\d+\.?\d+
(您需要数字,可能是一个点和一个数字).示例代码:
There are several issues with your code. First you should use re.compile(ur'<unicode string>')
. Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+
doesn't handle decimals just a series of numbers, you should use \d+\.?\d+
instead (you want number, probably a dot and a number). Example code:
#coding: utf-8
text = u"PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012-2-29 10:13:08"
import re
pattern = re.compile(ur'交易金额:(\d+\.?\d+)元', re.UNICODE)
print pattern.search(text).group(1)
这篇关于带有utf8问题的python正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!