带有unicode(日语)字符问题的regex python [英] regex python with unicode (japanese) character issue

查看:52
本文介绍了带有unicode(日语)字符问题的regex python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想删除下面的字符串的一部分(以粗体显示),它存储在字符串oldString

I want to remove part of a string (shown in bold) below, this is stored in the string oldString

[DMSM-8433] 加护亜依 Kago Ai – 加护亜依 vs. FRIDAY

[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY

我在 python 中使用以下正则表达式

im using the following regex within python

p=re.compile(ur"( [\W]+) (?=[A-Za-z ]+–)", re.UNICODE)
newString=p.sub("", oldString)

当我输出 newString 没有被删除

when i output the newString nothing has been removed

推荐答案

您可以使用以下代码段来解决问题:

You can use the following snippet to solve the issue:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
str = u'[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY'
regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[A-Za-z ]+–)'
p = re.compile(regex, re.U)
match = p.sub("", str)
print match.encode("UTF-8")

参见 IDEONE 演示

# -*- coding: utf-8 -*- 声明旁边,我添加了 @nhahtdh 检测日语符号的字符类.

Beside # -*- coding: utf-8 -*- declaration, I have added @nhahtdh's character class to detect Japanese symbols.

请注意,match 需要手动"编码为 UTF-8 字符串,因为 Python 2 需要提醒"我们一直在使用 Unicode.

Note that the match needs to be encoded as UTF-8 string "manually" since Python 2 needs to be "reminded" we are working with Unicode all the time.

这篇关于带有unicode(日语)字符问题的regex python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆