如何在Python中的字符串中找到中文或日语字符？ [英] How to find out Chinese or Japanese Character in a String in Python?

查看：2170 发布时间：2016/11/19 14:32:30 python string unicode utf-8 character-encoding

本文介绍了如何在Python中的字符串中找到中文或日语字符？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

例如：

  str ='sdf344asfasf天地方益3権sdfsdf'

将（）添加到中文和日语字符：

  strAfterConvert ='sdfasfasf（天地方益）3（権）sdfsdf'

$作为开始，您可以检查该字符是否在以下某个unicode块中：

$ b

Unicode Block'CJK Unified Ideographs' - U + 4E00到U + 9FFF

Unicode Block' CJK统一表意文字扩展A' - U + 3400到U + 4DBF

Unicode Block'CJK Unified Ideographs Extension B' - U + 20000 to U + 2A6DF

Unicode Block'CJK Unified Ideographs Extension C' - U + 2A700 to U + 2B73F

< a href =http://www.fileformat.info/info/unicode/block/cjk_unified_ideographs_extension_d/index.htm> Unicode块'CJK统一表意文字扩展名D' - U + 2B740到U + 2B81F

之后，所有你需要做的就是遍历字符串，检查char是中文，日语或韩语（CJK），并附加相应的内容：
＃ - * - coding：utf-8 - * - ranges = [ {from：ord（u\\\㌀），to：ord（u\\\㏿）}，＃compatibility ideographs {from ：ord（u\\\︰），to：ord（u\\\﹏）}，＃compatibility ideographs {from：ord（u\\\豈）， to：ord（u\\\﫿）}，＃compatibility ideographs {from：ord（u\U0002F800），to：ord（u\U0002fa1f）} ，＃compatibility ideographs {from：ord（u\\\゠），to：ord（u\\\ヿ）}，＃Japanese Kana {from： ord（u\\\⺀），to：ord（u\\\⻿）}，＃cjk radical补充 {from：ord（u\\\一），到：ord（u\\\鿿）}， {from：ord（u\\\㐀），to：ord（u\\\䶿）}， {from：ord（u\U00020000），to：ord（u\U0002a6df）}， {from：ord（u\U0002a700 to：ord（u\U0002b73f）}， {from：ord（u\U0002b740），to：ord（u\U0002b81f）}， {from：ord（u\U0002b820），to：ord（u\U0002ceaf）}包括Unicode 8.0 ] $ b b def is_cjk（char）： return any（[range [from]< = ord（char）< = range [to] for range in ranges]） def cjk_substrings（string）： i = 0 while i if is_cjk（string [i]）： start = i while is_cjk（string [i]）：i + = 1 yield string [start：i] i + = 1 string =sdf344asfasf天地方益3権sdfsdf。解码（utf-8）用于cjk_substrings（string）中的sub： string = string.replace（sub，（+ sub +））打印字符串
上述列印
sdf344asfasf（天地方益）3（権）sdfsdf
您可能想要了解CJK统一表意文字扩展名E.它将随Unicode 8.0一起提供，预定于2015年6月发布。

>

添加了 CJK兼容性表意文字，日本假名和 CJK激进

Such as:
str = 'sdf344asfasf天地方益3権sdfsdf'
Add () to Chinese and Japanese Characters:
strAfterConvert = 'sdfasfasf(天地方益)3(権)sdfsdf'

解决方案
As a start, you can check if the character is in one of the following unicode blocks:

Unicode Block 'CJK Unified Ideographs' - U+4E00 to U+9FFF

Unicode Block 'CJK Unified Ideographs Extension A' - U+3400 to U+4DBF

Unicode Block 'CJK Unified Ideographs Extension B' - U+20000 to U+2A6DF

Unicode Block 'CJK Unified Ideographs Extension C' - U+2A700 to U+2B73F

Unicode Block 'CJK Unified Ideographs Extension D' - U+2B740 to U+2B81F

After that, all you need to do is iterate through the string, checking if the char is Chinese, Japanese or Korean (CJK) and append accordingly:
# -*- coding:utf-8 -*- ranges = [ {"from": ord(u"\u3300"), "to": ord(u"\u33ff")}, # compatibility ideographs {"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")}, # compatibility ideographs {"from": ord(u"\uf900"), "to": ord(u"\ufaff")}, # compatibility ideographs {"from": ord(u"\U0002F800"), "to": ord(u"\U0002fa1f")}, # compatibility ideographs {"from": ord(u"\u30a0"), "to": ord(u"\u30ff")}, # Japanese Kana {"from": ord(u"\u2e80"), "to": ord(u"\u2eff")}, # cjk radicals supplement {"from": ord(u"\u4e00"), "to": ord(u"\u9fff")}, {"from": ord(u"\u3400"), "to": ord(u"\u4dbf")}, {"from": ord(u"\U00020000"), "to": ord(u"\U0002a6df")}, {"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")}, {"from": ord(u"\U0002b740"), "to": ord(u"\U0002b81f")}, {"from": ord(u"\U0002b820"), "to": ord(u"\U0002ceaf")} # included as of Unicode 8.0 ] def is_cjk(char): return any([range["from"] <= ord(char) <= range["to"] for range in ranges]) def cjk_substrings(string): i = 0 while i<len(string): if is_cjk(string[i]): start = i while is_cjk(string[i]): i += 1 yield string[start:i] i += 1 string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8") for sub in cjk_substrings(string): string = string.replace(sub, "(" + sub + ")") print string
The above prints
sdf344asfasf(天地方益)3(権)sdfsdf
To be future-proof, you might want to keep a lookout for CJK Unified Ideographs Extension E. It will ship with Unicode 8.0, which is scheduled for release in June 2015. I've added it to the ranges, but you shouldn't include it until Unicode 8.0 is released.

[EDIT]

Added CJK compatibility ideographs, Japanese Kana and CJK radicals.

这篇关于如何在Python中的字符串中找到中文或日语字符？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Python中的字符串中找到中文或日语字符？ [英] How to find out Chinese or Japanese Character in a String in Python?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在Python中的字符串中找到中文或日语字符？ [英] How to find out Chinese or Japanese Character in a String in Python?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭