Python中unicode字符串的显示宽度 [英] Display width of unicode strings in Python

查看:53
本文介绍了Python中unicode字符串的显示宽度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何确定 Python 3.x 中 Unicode 字符串的显示宽度,有没有办法使用该信息将这些字符串与 str.format() 对齐?

激励示例: 将字符串表打印到控制台.一些字符串包含非 ASCII 字符.

<预><代码>>>>对于 d.keys() 中的标题:>>>打印("{:<20} | {}".format(title, d[title]))动物园- |动物园.zootekni- |佐泰克.动物园|动物园.zooveterinar- |动物园.zoovetinstitut- |动物园妈妈 |母母>>>s = 'è'>>>镜片)2>>>[ord(c) for c in s][101, 768]>>>unicodedata.name(s[1])'结合严重的口音'>>>s2 = '母'>>>镜头(s2)1

可以看出,str.format() 只是将字符串中的代码点数 (len(s)) 作为其宽度,导致输出中的倾斜列.搜索 unicodedata 模块,我没有找到任何建议解决方案的内容.

Unicode 规范化 可以解决 è 的问题,但不能解决通常具有较大显示宽度的亚洲字符的问题.类似地,存在零宽度 unicode 字符(例如,允许单词内的换行符的零宽度空间).您无法通过规范化来解决这些问题,因此请不要建议规范化您的字符串".

添加了有关规范化的信息.

编辑 2:在我的原始数据集中也有一些欧洲组合字符,即使在标准化后也不会产生单个代码点:

 zwemwater |茨威姆.zwia̢z- |zw.>>>s3 = 'a\u0322' # 来自 zwiaz 的 'a + 下面结合卷曲钩'>>>len(unicodedata.normalize('NFC', s3))2

解决方案

您有几个选择:

  1. 某些控制台支持转义序列以精确定位光标.不过可能会造成一些叠印.

    历史记录:Amiga 终端使用这种方法在控制台窗口中显示图像,方法是打印一行文本,然后将光标向下移动一个像素.文本行的剩余像素慢慢构建了一个图像.

  2. 在您的代码中创建一个表格,其中包含控制台/终端窗口中使用的字体中所有 Unicode 字符的实际(像素)宽度.使用一个 UI 框架和一个小的 Python 脚本来生成这个表.

    然后添加使用此表计算文本实际宽度的代码.但是,结果可能不是控制台中字符宽度的倍数.结合像素精确的光标移动,这可能会解决您的问题.

    注意:您必须为连字(fi、fl)和添加特殊处理复合物.或者,您可以在不打开窗口的情况下加载 UI 框架并使用图形基元来计算字符串宽度.

  3. 使用制表符 (\t) 进行缩进.但这只有在您的 shell 实际使用实际文本宽度来放置光标时才会有所帮助.许多终端只会计算字符数.

  4. 创建一个带有表格的 HTML 文件并在浏览器中查看.

How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with str.format()?

Motivating example: Printing a table of strings to the console. Some of the strings contain non-ASCII characters.

>>> for title in d.keys():
>>>     print("{:<20} | {}".format(title, d[title]))

    zootehni-           | zooteh.
    zootekni-           | zootek.
    zoothèque          | zooth.
    zooveterinar-       | zoovet.
    zoovetinstitut-     | zoovetinst.
    母                   | 母母

>>> s = 'è'
>>> len(s)
    2
>>> [ord(c) for c in s]
    [101, 768]
>>> unicodedata.name(s[1])
    'COMBINING GRAVE ACCENT'
>>> s2 = '母'
>>> len(s2)
    1

As can be seen, str.format() simply takes the number of code-points in the string (len(s)) as its width, leading to skewed columns in the output. Searching through the unicodedata module, I have not found anything suggesting a solution.

Unicode normalization can fix the problem for è, but not for Asian characters, which often have larger display width. Similarly, zero-width unicode characters exist (e.g. zero-width space for allowing line breaks within words). You can't work around these issues with normalization, so please do not suggest "normalize your strings".

Edit: Added info about normalization.

Edit 2: In my original dataset also have some European combining characters that don't result in a single code-point even after normalization:

    zwemwater     | zwemw.
    zwia̢z-       | zw.

>>> s3 = 'a\u0322'   # The 'a + combining retroflex hook below' from zwiaz
>>> len(unicodedata.normalize('NFC', s3))
    2

解决方案

You have several options:

  1. Some consoles support escape sequences for pixel-exact positioning of the cursor. Might cause some overprinting, though.

    Historical note: This approach was used in the Amiga terminal to display images in a console window by printing a line of text and then advancing the cursor down by one pixel. The leftover pixels of the text line slowly built an image.

  2. Create a table in your code which contains the real (pixel) widths of all Unicode characters in the font that is used in the console / terminal window. Use a UI framework and a small Python script to generate this table.

    Then add code which calculates the real width of the text using this table. The result might not be a multiple of the character width in the console, though. Together with pixel-exact cursor movement, this might solve your issue.

    Note: You'll have to add special handling for ligatures (fi, fl) and composites. Alternatively, you can load a UI framework without opening a window and use the graphics primitives to calculate the string widths.

  3. Use the tab character (\t) to indent. But that will only help if your shell actually uses the real text width to place the cursor. Many terminals will simply count characters.

  4. Create a HTML file with a table and look at it in a browser.

这篇关于Python中unicode字符串的显示宽度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆