格式化包含非 ascii 字符的列 [英] Formatting columns containing non-ascii characters

查看:30
本文介绍了格式化包含非 ascii 字符的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我想对齐包含非 ascii 字符的字段.以下似乎不起作用:

for word1, word2 in [['hello', 'world'], ['こんにちは', '世界']]:打印 "{:<20} {:<20}".format(word1, word2)你好,世界こんにちは 世界

有解决办法吗?

解决方案

您正在格式化多字节编码的字符串.您似乎正在使用 UTF-8 来编码您的文本,并且该编码使用每个代码点的多个字节(1 到 4 之间,具体取决于特定字符).格式化字符串计算字节,而不是代码点,这就是您的字符串最终未对齐的原因之一:

<预><代码>>>>len('你好')5>>>len('こんにちは')15>>>len(u'こんにちは')5

将文本格式化为 Unicode 字符串,以便您可以计算代码点数,而不是字节数:

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:打印 u"{:<20} {:<20}".format(word1, word2)

你的下一个问题是这些字符也比大多数字符;你有双宽代码点:

<预><代码>>>>导入 unicodedata>>>unicodedata.east_asian_width(u'h')'呐'>>>unicodedata.east_asian_width(u'世')'W'>>>对于 [[u'hello', u'world'], [u'こんにちは', u'世界']] 中的 word1、word2:... 打印 u"{:<20} {:<20}".format(word1, word2)...你好,世界こんにちは 世界

str.format() 没有能力处理这个问题;您必须在格式化之前根据 Unicode 标准中注册为更宽的字符数手动调整列宽.

棘手,因为有多个宽度可用.请参阅东亚宽度 Unicode 标准附件;有模糊宽度;窄是大多数其他字符打印的宽度,宽是我终端上的两倍.模棱两可是...实际显示的宽度不明确:

<块引用>

不明确的字符需要字符代码中未包含的附加信息来进一步解析其宽度.

它们的显示方式取决于上下文;例如,希腊字符在西方文本中显示为窄字符,但在东亚上下文中显示为宽字符.我的终端将它们显示为窄,但其他终端(例如,为东亚语言环境配置)可能将它们显示为宽.我不确定是否有任何万无一失的方法来弄清楚它是如何工作的.

在大多数情况下,您需要为 unicodedata.east_asian_width() 计算带有 'W''F' 值的字符担任2个职位;从您的格式宽度中减去 1:

def calc_width(target, text):返回目标 - sum(unicodedata.east_asian_width(c) in 'WF' for c in text)对于 [[u'hello', u'world'], [u'こんにちは', u'世界']] 中的 word1、word2:打印 u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20, word2))

然后在我的终端中生成所需的对齐:

<预><代码>>>>对于 [[u'hello', u'world'], [u'こんにちは', u'世界']] 中的 word1、word2:... 打印 u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20, word2))...你好,世界こんにちは 世界

您在上面可能看到的轻微错位是您的浏览器或字体对宽代码点使用了不同的宽度比(不是两倍).

所有这些都伴随着一个警告:并非所有终端都支持东亚宽度 Unicode 属性,并且仅以一个宽度显示所有代码点.

So I want to align fields containing non-ascii characters. The following does not seem to work:

for word1, word2 in [['hello', 'world'], ['こんにちは', '世界']]:
    print "{:<20} {:<20}".format(word1, word2)

hello                world
こんにちは      世界

Is there a solution?

解决方案

You are formatting a multi-byte encoded string. You appear to be using UTF-8 to encode your text and that encoding uses multiple bytes per codepoint (between 1 and 4 depending on the specific character). Formatting a string counts bytes, not codepoints, which is one reason why your strings end up misaligned:

>>> len('hello')
5
>>> len('こんにちは')
15
>>> len(u'こんにちは')
5

Format your text as Unicode strings instead, so that you can count codepoints, not bytes:

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    print u"{:<20} {:<20}".format(word1, word2)

Your next problem is that these characters are also wider than most; you have double-wide codepoints:

>>> import unicodedata
>>> unicodedata.east_asian_width(u'h')
'Na'
>>> unicodedata.east_asian_width(u'世')
'W'
>>> for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
...     print u"{:<20} {:<20}".format(word1, word2)
...
hello                world
こんにちは                世界

str.format() is not equipped to deal with that issue; you'll have to manually adjust your column widths before formatting based on how many characters are registered as wider in the Unicode standard.

This is tricky because there is more than one width available. See the East Asian Width Unicode standard annex; there are narrow, wide and ambigious widths; narrow is the width most other characters print at, wide is double that on my terminal. Ambiguous is... ambiguous as to how wide it'll actually be displayed:

Ambiguous characters require additional information not contained in the character code to further resolve their width.

It depends on the context how they are displayed; greek characters for example are displayed as narrow characters in a Western text, but wide in an East Asian context. My terminal displays them as narrow, but other terminals (configured for an east-asian locale, for example) may display them as wide instead. I'm not sure if there are any fool-proof ways of figuring out how that would work.

For the most part, you need to count characters with a 'W' or 'F' value for unicodedata.east_asian_width() as taking 2 positions; subtract 1 from your format width for each of these:

def calc_width(target, text):
    return target - sum(unicodedata.east_asian_width(c) in 'WF' for c in text)

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    print u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20,  word2))

This then produces the desired alignment in my terminal:

>>> for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
...     print u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20,  word2))
...
hello                world
こんにちは           世界

The slight misalignment you may see above is your browser or font using a different width ratio (not quite double) for the wide codepoints.

All this comes with a caveat: not all terminals support the East-Asian Width Unicode property, and display all codepoints at one width only.

这篇关于格式化包含非 ascii 字符的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆