Python Unicode正则表达式 [英] Python Unicode Regular Expression

查看:96
本文介绍了Python Unicode正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python 2.4,并且unicode正则表达式存在一些问题.我试图为我的问题整理出一个非常清晰简洁的例子.看来Python如何识别不同的字符编码存在问题,或者我的理解存在问题.非常感谢您的光临!

I am using python 2.4 and I am having some problems with unicode regular expressions. I have tried to put together a very clear and concise example of my problem. It looks as though there is some problem with how Python is recognizing the different character encodings, or a problem with my understanding. Thank you very much for taking a look!

#!/usr/bin/python
#
# This is a simple python program designed to show my problems with regular expressions and character encoding in python
# Written by Brian J. Stinar
# Thanks for the help! 

import urllib # To get files off the Internet
import chardet # To identify charactor encodings
import re # Python Regular Expressions 
#import ponyguruma # Python Onyguruma Regular Expressions - this can be uncommented if you feel like messing with it, but I have the same issue no matter which RE's I'm using

rawdata = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read()
print (chardet.detect(rawdata))
#print (rawdata)

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2') # Let's grab this as text
UTF_8_encoded = ISO_8859_2_encoded.encode('utf-8') # and encode the text as UTF-8
print(chardet.detect(UTF_8_encoded)) # Looks good

# This totally doesn't work, even though you can see UNSUBSCRIBE in the HTML
# Eventually, I want to recognize the entire physical address and UNSUBSCRIBE above it
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE)
print (str(re_UNSUB_amsterdam.match(UTF_8_encoded)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on UTF-8")
print (str(re_UNSUB_amsterdam.match(rawdata)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on raw data")

re_amsterdam = re.compile(".*Adobe.*", re.UNICODE)
print (str(re_amsterdam.match(rawdata)) + "\t--- RE for 'Adobe' on raw data") # However, this work?!?
print (str(re_amsterdam.match(UTF_8_encoded)) + "\t--- RE for 'Adobe' on UTF-8")

'''
# In additon, I tried this regular expression library much to the same unsatisfactory result
new_re = ponyguruma.Regexp(".*UNSUBSCRIBE.*")
if new_re.match(UTF_8_encoded) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on UTF-8")
else:
   print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on UTF-8")

if new_re.match(rawdata) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on raw data")
else:
   print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on raw data")

new_re = ponyguruma.Regexp(".*Adobe.*")
if new_re.match(UTF_8_encoded) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on UTF-8")
else:
   print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on UTF-8")

new_re = ponyguruma.Regexp(".*Adobe.*")
if new_re.match(rawdata) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on raw data")
else:
   print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on raw data")
'''

我正在从事替代项目,并且在处理非ASCII编码的文件时遇到了困难.这个问题是一个更大的项目的一部分-最终我想用其他文本替换该文本(我使用ASCII来工作,但是我还无法识别其他编码中的出现.)再次感谢.

I am working on a substitution project, and am having a difficult time with the non-ASCII encoded files. This problem is part of a bigger project - eventually I would like to substitute the text with other text (I got this working in ASCII, but I can't identify occurrences in other encodings yet.) Thanks again.

http://brian-stinar.blogspot.com

-Brian J. Stinar-

-Brian J. Stinar-

推荐答案

您可能要启用DOTALL标志,或者要使用 search 方法而不是 match 方法.即:

You probably want to either enable the DOTALL flag or you want to use the search method instead of the match method. ie:

# DOTALL makes . match newlines 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE | re.DOTALL)

或:

# search will find matches even if they aren't at the start of the string
... re_UNSUB_amsterdam.search(foo) ...

这些将为您提供不同的结果,但两者都应为您提供匹配项.(查看您想要的类型.)

These will give you different results, but both should give you matches. (See which one is the type you want.)

顺便说一句:您似乎混淆了编码文本(字节)和解码文本(字符).这并不少见,特别是在3.x之前的Python版本中.特别是,这非常可疑:

As an aside: You seem to be getting the encoded text (which is bytes) and decoded text (characters) confused. This isn't uncommon, especially in pre-3.x Python. In particular, this is very suspicious:

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2')

您使用的是ISO-8859-2的 de 编码,而不是 en 的编码,因此将此变量称为已解码".(为什么不使用"ISO_8859_2_decoded"?因为ISO_8859_2是一种编码.解码后的字符串不再具有编码.)

You're de-coding with ISO-8859-2, not en-coding, so call this variable "decoded". (Why not "ISO_8859_2_decoded"? Because ISO_8859_2 is an encoding. A decoded string doesn't have an encoding anymore.)

当可能应该使用解码的unicode字符串代替时,其余代码将尝试对rawdata和UTF_8_encoded(两个编码字符串)进行匹配.

The rest of your code is trying to do matches on rawdata and on UTF_8_encoded (both encoded strings) when it should probably be using the decoded unicode string instead.

这篇关于Python Unicode正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆